Search⌘ K
AI Features

Programmatic ETL with AWS Glue

Explore how to build scalable ETL pipelines with AWS Glue to prepare data for machine learning on AWS. Understand schema discovery with the Data Catalog, apply PySpark transformation logic like deduplication and outlier handling, and validate data quality before training. Learn integration patterns with S3 and Redshift to optimize inputs for SageMaker training jobs.

AWS Glue serves as a primary engine for converting raw data in S3 or Redshift into ML-ready datasets. As a fully managed, serverless ETL service built on Apache Spark, Glue provides the distributed processing power required to clean, validate, and restructure data at scale. By automating the discovery and transformation of these assets, Glue helps ensure that SageMaker training jobs receive high-quality, structured inputs, effectively bridging the gap between raw storage and predictive modeling.

This lesson covers the full Glue ETL workflow for ML data engineering, including the Data Catalog and crawlers for automated schema discovery; PySpark-based transformation logic for deduplication and outlier handling; AWS Glue Data Quality for pretraining validation; and integration patterns with S3 and Redshift that feed SageMaker training pipelines. One optimization worth noting early is writing Glue ETL output in Parquet format with Snappy compression, which can significantly improve I/O performance and reduce storage costs for downstream SageMaker jobs.

Understanding how Glue discovers and organizes metadata is the first step in building any programmatic ETL pipeline.

Schema discovery with the Data Catalog and crawlers

The AWS Glue Data CatalogA centralized, persistent metadata repository that stores table definitions, schema information, and partition metadata for datasets across S3, Redshift, and RDS. serves as the single source of truth for all dataset metadata in a Glue-based pipeline. Rather than manually defining table schemas, engineers rely on automated agents called crawlers that scan data sources, infer schemas using built-in classifiers for formats such as JSON, CSV, Parquet, and Avro, and then populate the Data Catalog with structured table metadata.

How crawlers populate the Data Catalog

The crawler workflow follows a predictable sequence that reduces manual metadata management overhead.

  • Source configuration: You point the crawler at a data source path, such as an S3 bucket ...