Meet the Data
Learn how to fetch data from files, APIs, and the web using Python.
Retrieving data is the first and most critical step in any data engineering workflow. Whether pulling a file from local storage, connecting to an external API, or scraping content from the web, everything starts with getting the data in. A data engineer must fetch clean, relevant, and structured data before building robust systems. This is where the real work begins—and reliable engineering makes all the difference.
Why does data fetching matter?
Fetching data isn’t just a “first step”—it’s a foundational skill. Data doesn’t always arrive neatly packaged and waiting in a database. Sometimes you pull logs from a server, read raw files, call APIs, or scrape a site for details. The more comfortable you get data from different places, the more flexible and powerful your systems will be.
In this lesson, we’ll explore three common sources of data and how to fetch each using Python:
Files (like CSVs and JSON)
APIs (Application Programming Interfaces)
Web pages (using scraping)
Let’s head into the pantry and start collecting what we need.
1. Working with files
Files are one of the most common ways to store and exchange data. We often encounter two popular formats: CSV (Comma-Separated Values) and JSON (JavaScript Object Notation).
A simple way to read data from these files is to use pandas, a powerful Python library built for data manipulation and analysis. At the heart of pandas is the DataFrame, a two-dimensional, table-like structure with rows and columns, similar to a spreadsheet.
With pandas, we can easily load data into a DataFrame. Once the data is in this structure, it can be explored, modified, cleaned, and analyzed using pandas’ rich built-in functions.
Working with CSV files using pandas
CSV (Comma-Separated ...