AWS Glue

Understand the key features of AWS Glue, including how to use AWS Glue Studio to set up and run an ETL job.

AWS Glue is a platform for ETL workflows. The term ETL implies that the data can be transformed before it’s loaded into a new location.

Press + to interact
AWS Glue is a tool that can move data from a source to a target location
AWS Glue is a tool that can move data from a source to a target location

Launched in 2017, AWS Glue is designed to be a serverless platform where computing power is used (and billed) when actively required. Behind the scenes, AWS Glue also utilizes Apache Spark, an open-source engine for processing large data volumes.

AWS Glue Data Catalog

The concept of a data catalog is used by AWS Glue and Amazon Lake Formation to understand the structure of an existing data source. It’s considered metadata—not the data itself, but descriptions of the data, including the schema. Another way to describe a data catalog is as references to data in other locations.

Press + to interact
Analogous to a library catalog, the AWS Glue Data Catalog is a reference to data stored in other locations
Analogous to a library catalog, the AWS Glue Data Catalog is a reference to data stored in other locations

If you’ve previously set up a data lake using AWS Lake Formation, you might have noticed that some of the Lake Formation functionality relies on AWS Glue. This is because Lake Formation and AWS Glue share the same data catalog.

Features that involve the AWS Glue Data Catalog include:

  • Creating a database in the data catalog that references data stored elsewhere (such as Amazon S3).

  • Creating a crawler that fills the data catalog with the schema for a table (which references data that’s stored elsewhere).

  • Ability to run a crawler on demand or on a schedule and to add optional classifiers to the crawler that can understand more types of data.

    • AWS Glue has built-in classifiers to understand CSV, JSON, XML, and common relational database management systems.

Those familiar with MySQL-compatible databases might know that metadata can be stored within the same database system. For example, the SQL commands SHOW TABLES ...

Get hands-on with 1400+ tech skills courses.