Training Pipeline
Learn common requirements and patterns in building training pipelines.
Training pipeline
- A training pipeline needs to handle a large volume of data with low costs. One common solution is to store data in a column-oriented format like Parquet or ORC. These data formats enable high throughput for ML and analytics use cases. In other use cases, the
data format is widely used in the TensorFlow ecosystem.tfrecord TensorFlow format for storing a sequence of binary records
Data partitioning
- Parquet and ORC files usually get partitioned by time for efficiency as we can avoid scanning through the whole dataset. In this example, we partition data by year then by month. In practice, most common services on AWS,
, andRedShift Amazon fully managed petabyte-scale data warehouse service in the cloud. support Parquet and ORC. InAthena Interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.