...

>

Training Pipeline

Training Pipeline

Learn common requirements and patterns in building training pipelines.

Training pipeline

  • A training pipeline needs to handle a large volume of data with low costs. One common solution is to store data in a column-oriented format like Parquet or ORC. These data formats enable high throughput for ML and analytics use cases. In other use cases, the tfrecordTensorFlow format for storing a sequence of binary records data format is widely used in the TensorFlow ecosystem.

Data partitioning

  • Parquet and ORC files usually get partitioned by time for efficiency as we can avoid scanning through the whole dataset. In this example, we partition data by year then by month. In practice, most common services on AWS, RedShiftAmazon fully managed petabyte-scale data warehouse service in the cloud., and AthenaInteractive query service that makes it easy to analyze data in Amazon S3 using standard SQL. support Parquet and ORC. In
...