Pandas Cheat Sheet
Discover how to effectively use the pandas library to manipulate, clean, and analyze data in Python. Learn key concepts including data structures, file operations, selection, modifying data, handling missing values, aggregations, and combining datasets. Gain practical skills to preprocess and prepare data for machine learning workflows.
Pandas is one of the most popular tools for data analysis in Python. This open-source library is the backbone of many data projects and is used for data cleaning and data manipulation.
With pandas, you gain greater control over complex datasets. It’s an essential tool in the data analysis tool belt. If you’re not using pandas, you’re not making the most of your data.
In this lesson, we’ll explore a quick guide to the most essential operations and commands.
Getting started with pandas
Pandas is a powerful Python library that makes data manipulation, analysis, and preprocessing intuitive and efficient. It provides flexible data structures such as Series and DataFrames, which are essential for handling structured data in Python.
Pandas import convention
Import the pandas library using the conventional alias pd for data manipulation.
import pandas as pd
We can install pandas using pip:
$ pip install pandas
Create and name a Series
A pandas series is a one-dimensional array that can hold any data type. We can create it using pd.Series() by passing a list of values, and optionally assign custom indices for each element.
# Create a Series with custom indicesseries1 = pd.Series([1,2,3,4], index=['a','b','c','d'])
Assign a descriptive name to a Series for clarity.
# Assign a meaningful name to the Seriesseries1.name = "Monthly Sales"
Series name helps identify what the data represents when exporting, plotting, or combining with other data.
Next, name the index of a Series for better organization.
series1.index.name = "Product ID"
Index name makes it easier to understand and reference elements, especially when working with multiple Series.
Create a DataFrame
A DataFrame is a 2D table-like structure in pandas, composed of rows and columns. By default, pandas creates numeric column names 0, 1, … and row indices 0, 1, ….
# Create DataFrame from a 2D listdata = [[10, 20], [30, 40], [50, 60]]df = pd.DataFrame(data)# Output# 0 1# 0 10 20# 1 30 40# 2 50 60
DataFrames are the core data structure for preprocessing features, analyzing patterns, and feeding data to ML models.
Reading and writing data
Pandas supports a wide range of file formats, making it versatile for different data sources. Learning how to efficiently read and write data ensures our workflows are reproducible and ready for analysis or modeling.
Read and write to CSV file
CSV files are commonly used to store datasets. Efficiently reading and writing CSVs allows smooth preprocessing, experimentation, and sharing of data.
We can specify a column to use as the DataFrame index when reading a CSV. The first column is numbered 0, but we can also use column names. If no index is specified, pandas will create a default numeric index starting from 0.
import pandas as pd# Read CSV using full file pathdata = pd.read_csv('/path/to/file.csv') # Specify full path to file# Read CSV without specifying an index (default numeric index will be created)data = pd.read_csv('file.csv') # File is in current working directory# Read CSV and use the first column as the indexdata = pd.read_csv('data.csv', index_col=0)# Read CSV and use a column by name as the indexdata = pd.read_csv('data.csv', index_col='ID')# Write DataFrame to CSV (exclude index if not needed)data.to_csv('output.csv', index=False)
index_col=0: Uses the first column as the index.index_col='ID': Uses the column named'ID'as the index.If no index is specified, pandas generates a default numeric index
0, 1, 2….
This ensures our data is loaded with the correct indexing for downstream analysis and can be saved back to CSV efficiently.
Read and write to an Excel file
Call the read_excel function to access an Excel file. Pass the name of the Excel file as an argument. Use to_excel to save a DataFrame back to an Excel file, optionally specifying the sheet name.
# Read an Excel file into a DataFramedf = pd.read_excel('file.xlsx')# Write the DataFrame to an Excel file, specifying the sheet namedf.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet2')
Excel files are often used for ...