AWS Programmable Batch Ingestion Services
Programmable batch ingestion in AWS allows data engineers to manage data extraction, transformation, and loading (ETL) at scale using four key services: AWS Glue, Amazon Redshift, Amazon EMR, and AWS Lambda. AWS Glue is the preferred choice for serverless ETL, featuring schema discovery and incremental processing. Amazon Redshift excels in bulk loading and unloading data with its COPY and UNLOAD commands, while EMR is suited for custom distributed frameworks. AWS Lambda processes small files in real-time, triggered by events. Each service has distinct roles, with optimization principles emphasizing the use of columnar formats and appropriate file sizes.
Programmable batch ingestion enables data engineers to move beyond drag-and-drop interfaces and take full control of how data is extracted, transformed, and loaded at scale. This lesson focuses on four programmable services that let you write transformation code, tune execution parameters, and control every stage of the ingestion pipeline. These services, AWS Glue, Amazon Redshift, Amazon EMR, and AWS Lambda, appear repeatedly on the AWS Certified Data Engineer – Associate exam, and understanding when to select each one is a critical skill. All four services read from or write to S3, reinforcing the data lake landing-zone pattern.
The guiding exam principle is straightforward:
Glue is the AWS-preferred default for batch ETL with schema evolution,
Redshift COPY handles bulk warehouse loading,
EMR is reserved for custom distributed framework requirements, and
Lambda processes lightweight event-driven file triggers under its 15-minute ceiling.
AWS Glue for batch ETL ingestion
AWS Glue is a fully serverless ETL service that eliminates infrastructure management while giving engineers fine-grained control over data transformation. It sits at the center of most AWS batch ingestion architectures because it combines schema discovery, transformation execution, and incremental processing into a single, managed platform.
Glue crawlers and the Data Catalog
A Glue Crawler scans data sources, such as S3 prefixes or JDBC-connected databases, infers the underlying data schema, and registers table definitions in the AWS Glue Data Catalog. The Data Catalog functions as a centralized metadata repository that downstream services, including Amazon Athena, Redshift Spectrum, and Amazon EMR, query to understand table structures, column types, and partition layouts. When source schemas evolve over time (new columns added, data types changed), crawlers detect these changes and automatically update the catalog, making Glue the AWS-preferred choice for environments with frequent schema evolution.
ETL jobs, DPUs, and job bookmarks
Glue ETL Jobs execute PySpark or Python Shell scripts that read from sources registered in the Data Catalog, apply transformations such as filtering, joining, and format conversion to Parquet, and write results to target locations. The compute power behind these jobs is measured in ...