AWS Glue DataBrew
Learn about the key features of AWS Glue DataBrew, including how it can be used to profile and transform data.
AWS Glue DataBrew is a visual tool designed to make it easier to clean and normalize data before it’s further used for analysis and machine learning.
Launched in 2020, Glue DataBrew includes over 250 prebuilt transformations to ease data preparation tasks. Transformations include removing nulls and fixing inconsistencies.
Glue DataBrew also has features for evaluating data quality, including detecting data anomalies.
Glue DataBrew can get input data from file uploads, Amazon S3, and other data sources. It can output data to S3.
Data profiling features
Understanding new or unfamiliar datasets is a common challenge for data analysts and scientists. Since data is often collected by other people or processes, it can take time to correctly interpret the available data. When data isn’t defined clearly or organized well, it becomes harder to gain valid and useful insights from that data.
AWS Glue DataBrew offers some features that can assist in understanding datasets. Let’s look into these features using a sample dataset.
Selecting a dataset
From the AWS Glue DataBrew section of the AWS Console, click the button “Create sample project.” A pop-up appears that allows us to choose a sample dataset and IAM role.
For simplicity, we choose the “Popular names for babies in 2020” sample dataset. The data description implies that the names are from the year 2020, and the data contains the number of occurrences of each name.
If we don’t already have an IAM role that we use with DataBrew, we can create a new role through this pop-up. For example, if we enter “DataBrewRole” as the suffix, there’ll be a new role “AWSGlueDataBrewServiceRole-DataBrewRole” that we’ll be able to reuse in the future.
We click “Create project” and wait for DataBrew to set up the project session. (The first 40 sessions are free for first-time users of DataBrew.)
Interpreting the dataset
The AWS Glue DataBrew user interface can feel overwhelming. Let’s try to interpret our selected dataset through the “Projects” interface.
Scrolling horizontally in the default project view shows the 5 columns in our dataset: “count,” “gender,” “id,” “name,” and “year.”
We make these initial observations: ...
Get hands-on with 1400+ tech skills courses.