Table Talk: Meet pandas
Learn to master pandas Series and DataFrames for data engineering workflows.
We'll cover the following...
Let’s be real—raw data is messy. It’s like dumping a truckload of unsorted LEGO bricks on your desk. Before you can build anything cool (say, a reporting dashboard or a real-time pipeline), you need to organize the chaos.
That’s where pandas come in.
While you might associate pandas with data scientists, smart data engineers know that mastering pandas can significantly improve their ability to wrangle, validate, and prepare data before it hits a database or pipeline. Think of pandas as a versatile toolkit for data—it helps you inspect, reshape, clean, and validate datasets fast, especially during the prototyping phase.
In this lesson, we’re going to get hands-on with the two core building blocks of pandas: Series and DataFrames. These are the tools that help transform unruly data into structured formats ready for the next stage of your data pipeline.
What is a Series?
A Series is like a column in a spreadsheet—but with more flexibility. It's a one-dimensional, labeled array that lets you assign meaningful names (indexes) to each item.
import pandas as pddata = [100, 200, 300, 400]s = pd.Series(data)print(s)
This is the simplest way to create a Series: by passing in a list of values. pandas automatically assigns an integer index to each item. Each item in the list becomes a data point, and pandas creates a default numeric index (0, 1, 2...).
By default, pandas assigns an index starting from 0. This is helpful when quickly scanning through unknown data.
Custom index
Adding your own index is like adding labels on test tubes in a lab. It turns data from generic into useful. We can label the data points with names, IDs, or any other identifiers using the index
parameter in Series()
.
import pandas as pds = pd.Series([100, 200, 300, 400], index=['a', 'b', 'c', 'd'])print(s)
We create another Series using the same data
list, but this time we specify custom labels for the index using the index
parameter. Instead of default numerical indices, we use the labels 'a'
, 'b'
, 'c'
, and 'd'
...