Programmatic ETL with AWS Glue
Explore how to build scalable ETL pipelines with AWS Glue to prepare data for machine learning on AWS. Understand schema discovery with the Data Catalog, apply PySpark transformation logic like deduplication and outlier handling, and validate data quality before training. Learn integration patterns with S3 and Redshift to optimize inputs for SageMaker training jobs.
We'll cover the following...
- Schema discovery with the Data Catalog and crawlers
- Transformation logic in Glue ETL jobs
- AWS Glue Data Quality for validation before training
- Integrating AWS Glue with S3 and Redshift
AWS Glue serves as a primary engine for converting raw data in S3 or Redshift into ML-ready datasets. As a fully managed, serverless ETL service built on Apache Spark, Glue provides the distributed processing power required to clean, validate, and restructure data at scale. By automating the discovery and transformation of these assets, Glue helps ensure that SageMaker training jobs receive high-quality, structured inputs, effectively bridging the gap between raw storage and predictive modeling.
This lesson covers the full Glue ETL workflow for ML data engineering, including the Data Catalog and crawlers for automated schema discovery; PySpark-based transformation logic for deduplication and outlier handling; AWS Glue Data Quality for pretraining validation; and integration patterns with S3 and Redshift that feed SageMaker training pipelines. One optimization worth noting early is writing Glue ETL output in Parquet format with Snappy compression, which can significantly improve I/O performance and reduce storage costs for downstream SageMaker jobs.
Understanding how Glue discovers and organizes metadata is the first step in building any programmatic ETL pipeline.
Schema discovery with the Data Catalog and crawlers
The
How crawlers populate the Data Catalog
The crawler workflow follows a predictable sequence that reduces manual metadata management overhead.
Source configuration: You point the crawler at a data source path, such as an S3 bucket ...