Programmatic ETL with AWS Glue

Explore how to build scalable ETL pipelines with AWS Glue to prepare data for machine learning on AWS. Understand schema discovery with the Data Catalog, apply PySpark transformation logic like deduplication and outlier handling, and validate data quality before training. Learn integration patterns with S3 and Redshift to optimize inputs for SageMaker training jobs.

We'll cover the following...

Schema discovery with the Data Catalog and crawlers
- How crawlers populate the Data Catalog
Transformation logic in Glue ETL jobs
AWS Glue Data Quality for validation before training
- Writing rules with DQDL
Integrating AWS Glue with S3 and Redshift
- S3-to-S3: The data lake pattern
- Redshift-to-S3: The warehouse pattern

AWS Glue serves as a primary engine for converting raw data in S3 or Redshift into ML-ready datasets. As a fully managed, serverless ETL service built on Apache Spark, Glue provides the distributed processing power required to clean, validate, and restructure data at scale. By automating the discovery and transformation of these assets, Glue helps ensure that SageMaker training jobs receive high-quality, structured inputs, effectively bridging the gap between raw storage and predictive modeling.

This lesson covers the full Glue ETL workflow for ML data engineering, including the Data Catalog and crawlers for automated schema discovery; PySpark-based transformation logic for deduplication and outlier handling; AWS Glue Data Quality for pretraining validation; and integration patterns with S3 and Redshift that feed SageMaker training pipelines. One optimization worth noting early is writing Glue ETL output in Parquet format with Snappy compression, which can significantly improve I/O performance and reduce storage costs for downstream SageMaker jobs.

Understanding how Glue discovers and organizes metadata is the first step in building any programmatic ETL pipeline.

Schema discovery with the Data Catalog and crawlers

The AWS Glue Data CatalogA centralized, persistent metadata repository that stores table definitions, schema information, and partition metadata for datasets across S3, Redshift, and RDS. serves as the single source of truth for all dataset metadata in a Glue-based pipeline. Rather than manually defining table schemas, engineers rely on automated agents called crawlers that scan data sources, infer schemas using built-in classifiers for formats such as JSON, CSV, Parquet, and Avro, and then populate the Data Catalog with structured table metadata.

How crawlers populate the Data Catalog

The crawler workflow follows a predictable sequence that reduces manual metadata management overhead.

Source configuration: You point the crawler at a data source path, such as an S3 bucket ...

1.Introduction and Exam Strategy

2.AWS Core Services for MLA-C01

Cloud Lab

Cloud Lab

Cloud Lab

3.Machine Learning Foundations for AWS Engineer

4.SageMaker and Secure ML Environments

5.Data Ingestion and Storage Architectures

Cloud Lab

Cloud Lab

6.Data Transformation and Feature Engineering

Cloud Lab

Cloud Lab

Cloud Lab

Cloud Lab

Cloud Lab

7.Data Quality, Labelling, and Governance

Cloud Lab

Cloud Lab

8.Managed AI and Generative AI Solutions

Cloud Lab

Cloud Lab

Cloud Lab

Cloud Lab

9.Model Development, Optimisation, and Management

Cloud Lab

10.Deployment, Inference, and Orchestration

Cloud Lab

Cloud Lab

Cloud Lab

Cloud Lab

11.Monitoring and Cost Optimisation

12.Conclusion

Assessment

13.Practice Exam Solution - AWS Certified Machine Learning Engineer

14.Free AWS Certified Machine Learning Engineer Associate Practice

Programmatic ETL with AWS Glue

Schema discovery with the Data Catalog and crawlers

How crawlers populate the Data Catalog