Data Analytics on AWS: An Architectural Guide/

...

AWS Glue DataBrew

Learn about the key features of AWS Glue DataBrew, including how it can be used to profile and transform data.

We'll cover the following...

Data profiling features
Data transformation features
Automation features
- Recipes
- Data quality (DQ) rulesets

Press + to interact

Launched in 2020, Glue DataBrew includes over 250 prebuilt transformations to ease data preparation tasks. Transformations include removing nulls and fixing inconsistencies.

Glue DataBrew also has features for evaluating data quality, including detecting data anomalies.

Glue DataBrew can get input data from file uploads, Amazon S3, and other data sources. It can output data to S3.

Data profiling features

Understanding new or unfamiliar datasets is a common challenge for data analysts and scientists. Since data is often collected by other people or processes, it can take time to correctly interpret the available data. When data isn’t defined clearly or organized well, it becomes harder to gain valid and useful insights from that data.

AWS Glue DataBrew offers some features that can assist in understanding datasets. Let’s look into these features using a sample dataset.

Selecting a dataset

From the AWS Glue DataBrew section of the AWS Console, click the button “Create sample project.” A pop-up appears that allows us to choose a sample dataset and IAM role.

Press + to interact

For simplicity, we choose the “Popular names for babies in 2020” sample dataset. The data description implies that the names are from the year 2020, and the data contains the number of occurrences of each name.

If we don’t already have an IAM role that we use with DataBrew, we can create a new role through this pop-up. For example, if we enter “DataBrewRole” as the suffix, there’ll be a new role “AWSGlueDataBrewServiceRole-DataBrewRole” that we’ll be able to reuse in the future.

We click “Create project” and wait for DataBrew to set up the project session. (The first 40 sessions are free for first-time users of DataBrew.)

Interpreting the dataset

The AWS Glue DataBrew user interface can feel overwhelming. Let’s try to interpret our selected dataset through the “Projects” interface.

Press + to interact

Overview

Data Sources

Data Ingestion

Scalable Data Lake

Unified Governance

Seamless Data Movement

Purpose-Built Analytics and Insights

Wrap Up

Scalable Machine Learning Model for Accurate Predictions on AWS

AWS Glue DataBrew

Data profiling features

Selecting a dataset

Interpreting the dataset