Search⌘ K
AI Features

Glue Data Quality

Explore how AWS Glue Data Quality helps automate data validation through declarative rulesets, ensuring ML datasets meet key quality dimensions such as completeness, uniqueness, and validity. Understand integration with ETL workflows and SageMaker Pipelines to protect ML models from poor data quality and schema drift, and learn how continuous monitoring and automated alerts maintain dataset reliability over time.

Machine learning models are only as reliable as the data used for training and inference. Incomplete records, duplicate entries, schema mismatches, and anomalous values can silently degrade model accuracy, introduce bias, and cause training failures. For the AWS Certified Machine Learning Engineer – Associate exam, understanding how to prevent these issues through automated data quality validation is essential. Rather than relying on manual inspection, production ML pipelines integrate declarative quality checks directly into ETL workflows, catching problems before data reaches a training job.

AWS Glue Data Quality is AWS’s managed service for rule-based, automated data quality validation in Glue ETL and Data Catalog workflows. It enables engineers to define constraints using a declarative, no-code syntax, evaluate datasets against those constraints during ETL execution, and route records based on pass/fail outcomes. This is distinct from SageMaker Data Wrangler, which focuses on interactive data preparation and transformation rather than systematic validation.

AWS Glue DataBrew complements AWS Glue Data Quality by providing visual tools for dataset profiling, preparation, and ruleset-based data quality validation within profiling jobs. Engineers can explore datasets, detect anomalies, apply transformations interactively, validate data quality using built-in rules, and generate quality reports. AWS Glue Data Quality, in contrast, focuses on automated, rule-based validation integrated directly into ETL pipelines and AWS Glue Data Catalog workflows for continuous enforcement.

SageMaker Data Wrangler can also be used upstream to prepare features and clean datasets interactively before exporting the resulting data to destinations such as Amazon S3 or SageMaker Feature Store. The exported dataset can then be cataloged and validated using AWS Glue Data Quality for automated, rule-based checks. Understanding these distinctions is important for the AWS Certified Machine Learning Engineer – Associate exam.

This lesson progresses from understanding quality dimensions to implementing Data Quality Definition Language (DQDL) rulesets, integrating checks into pipelines, and establishing continuous monitoring to protect downstream ML models.

Key data quality dimensions

Before defining any validation rules, ML ...