ETL Example—Extraction

Understand how extraction works in the ETL pipeline with the help of an example.

ETL pipelines should only be as complicated as they need to be. It's easy to get swept away by the latest industry trends and feel overwhelmed with new software and tools. When building an ETL pipeline, we’ll need to choose the most appropriate tool for each step. But not all pipelines have to be complex.

To demonstrate this, let’s build an entire ETL pipeline from scratch using the shell scripting language Bash. This will provide a valuable example of an ETL pipeline and show that it’s possible to do it (efficiently) using even a simple tool like Bash.

Bash is a Unix shell scripting language released in 1989. It stands for “Bourne Again SHell.” It lets users interact with the operating system using the command line. Although it’s not usually directly involved in an ETL process, it’s an essential and valuable tool with surprisingly good performance for processing batches of data. Let's look at how we can build an ETL pipeline using Bash.

ETL example: Transferring lottery data

In this example, we’ll build an ETL pipeline to transfer data about past winning lottery numbers. We need to extract raw data from an external source, transform and clean it, and load it to a PostgreSQL database for later analysis.

Extracting data

The first step is to extract data from an external source. Normally, we would extract data from the company's internal systems, such as databases, APIs, data warehouses, cloud services, and more. However, occasionally, we might need to extract data from external sources.

We extract the raw lottery winning numbers data from a CSV file hosted on a GitHub repository. With Bash, we can download data from web pages using the curl -o <FileName> <URL> command. We begin the project by creating a file called extract_data.sh.

Get hands-on with 1200+ tech skills courses.