Pandas Cheat Sheet
Discover how to effectively use the pandas library to manipulate, clean, and analyze data in Python. Learn key concepts including data structures, file operations, selection, modifying data, handling missing values, aggregations, and combining datasets. Gain practical skills to preprocess and prepare data for machine learning workflows.
Pandas is one of the most popular tools for data analysis in Python. This open-source library is the backbone of many data projects and is used for data cleaning and data manipulation.
With pandas, you gain greater control over complex datasets. It’s an essential tool in the data analysis tool belt. If you’re not using pandas, you’re not making the most of your data.
In this lesson, we’ll explore a quick guide to the most essential operations and commands.
Getting started with pandas
Pandas is a powerful Python library that makes data manipulation, analysis, and preprocessing intuitive and efficient. It provides flexible data structures such as Series and DataFrames, which are essential for handling structured data in Python.
Pandas import convention
Import the pandas library using the conventional alias pd for data manipulation.
import pandas as pd
We can install pandas using pip:
$ pip install pandas
Create and name a Series
A pandas series is a one-dimensional array that can hold any data type. We can create it using pd.Series() by passing a list of values, and optionally assign custom indices for each element.
# Create a Series with custom indicesseries1 = pd.Series([1,2,3,4], index=['a','b','c','d'])
Assign a descriptive name to a Series for clarity.
# Assign a meaningful name to the Seriesseries1.name = "Monthly Sales"
Series name helps identify what the data represents when exporting, plotting, or combining with other data.
Next, name the index of a Series for better organization.
series1.index.name = "Product ID"
Index name makes it easier to understand and reference elements, especially when working with multiple Series.
Create a DataFrame
A DataFrame is a 2D table-like structure in pandas, composed of rows and columns. By default, pandas creates numeric column names 0, 1, … and row indices 0, 1, ….
# Create DataFrame from a 2D listdata = [[10, 20], [30, 40], [50, 60]]df = pd.DataFrame(data)# Output# 0 1# 0 10 20# 1 30 40# 2 50 60
DataFrames are the core data structure for preprocessing features, analyzing patterns, and feeding data to ML models.
Reading and writing data
Pandas supports a wide range of file formats, making it versatile for different data sources. Learning how to efficiently read and write data ensures our workflows are reproducible and ready for analysis or modeling.
Read and write to CSV file
CSV files are commonly used to store datasets. Efficiently reading and writing CSVs allows smooth preprocessing, experimentation, and sharing of data.
We can specify a column to use as the DataFrame index when reading a CSV. The first column is numbered 0, but we can also use column names. If no index is specified, pandas will create a default numeric index starting from 0.
import pandas as pd# Read CSV using full file pathdata = pd.read_csv('/path/to/file.csv') # Specify full path to file# Read CSV without specifying an index (default numeric index will be created)data = pd.read_csv('file.csv') # File is in current working directory# Read CSV and use the first column as the indexdata = pd.read_csv('data.csv', index_col=0)# Read CSV and use a column by name as the indexdata = pd.read_csv('data.csv', index_col='ID')# Write DataFrame to CSV (exclude index if not needed)data.to_csv('output.csv', index=False)
index_col=0: Uses the first column as the index.index_col='ID': Uses the column named'ID'as the index.If no index is specified, pandas generates a default numeric index
0, 1, 2….
This ensures our data is loaded with the correct indexing for downstream analysis and can be saved back to CSV efficiently.
Read and write to an Excel file
Call the read_excel function to access an Excel file. Pass the name of the Excel file as an argument. Use to_excel to save a DataFrame back to an Excel file, optionally specifying the sheet name.
# Read an Excel file into a DataFramedf = pd.read_excel('file.xlsx')# Write the DataFrame to an Excel file, specifying the sheet namedf.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet2')
Excel files are often used for small datasets and manual feature inspection before converting to CSV or DataFrame for ML tasks.
Read and write to SQL Query
Pandas can interact with SQL databases to load or save structured datasets, making it easier to feed data into ML pipelines.
from sqlalchemy import create_engineimport pandas as pd# Create an in-memory SQLite databaseengine = create_engine('sqlite:///:memory:')# Read data from SQLpd.read_sql("SELECT * FROM my_table;", engine)pd.read_sql_table('my_table', engine)pd.read_sql_query("SELECT * FROM my_table;", engine)# Write a DataFrame to SQLdf.to_sql('myDf', engine)
Using pandas with SQL enables seamless data fetching, preprocessing, and saving directly from or to databases for ML workflows.
Selecting and accessing data
Selecting and accessing data efficiently is fundamental in pandas. It allows us to inspect and filter specific elements of the Series or DataFrame. Mastering these techniques is essential for preprocessing, analysis, and feature engineering in machine learning workflows.
Access elements in a Series
Retrieve elements by position or slice.
# First elementser[0]# First 5 elementsser[:5]# Last 5 elementsser[-5:]
Selecting specific values helps in inspecting features, labels, or subsets of data during preprocessing or debugging.
Select a single value in DataFrame
Access individual elements by position or label using pandas indexing methods. This is useful for debugging, checking feature values, and validating preprocessing steps.
# By position# Using iloc (row index, column index)df.iloc[[0],[0]]# Using iat (faster, returns a single value)df.iat[0, 0]# By label# Using loc (row label, column label)df.loc[[0], ['Label']]df.at[0, 'Label']
iloc: Access elements by integer position. Returns a DataFrame or Series.iat: Access a single element by integer position. Faster thaniloc.loc: Access elements by row and column labels. Returns a DataFrame or Series.at: Access a single element by row and column labels. Faster thanloc.
Access a DataFrame with a boolean index
In boolean indexing, we filter data using a boolean condition on one or more columns.
import pandas as pd# Dictionary of listsdata = {'name': ["name1", "name2", "name3", "name4"],'degree': ["degree1", "degree2", "degree3", "degree4"],'score': [1, 2, 3, 4]}df = pd.DataFrame(data)# Boolean indexing: select rows where score > 2filtered_df = df[df['score'] > 2]
Boolean indexing is key for filtering datasets, creating training/testing splits, and selecting rows based on conditions. The expression df['score'] > 2 generates a boolean vector [False, False, True, True], which is then used to filter the DataFrame.
Modifying DataFrames
DataFrames can be easily modified to add, remove, or rename columns and rows. These operations are essential for feature engineering, data cleaning, and preparing datasets for machine learning.
Create a new column in a DataFrame
Add a new column with default or calculated values.
df['new_column'] = 0 # Set all rows in 'new_column' to 0
Adding columns allows feature engineering or tracking new derived features for ML models.
Specify columns and index in a DataFrame
Defining columns and indices clearly organizes the data, making it easier to access, manipulate, and interpret during analysis or modeling. Proper labeling also helps avoid errors when performing operations like merging, filtering, or feature engineering.
Specify only column names
This is the simplest way to organize the data. Use this approach when we want to create a DataFrame with default numeric indices but custom column names.
import pandas as pd# Create a DataFrame and specify column names onlydf = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],columns=['a', 'b', 'c'])
Specify both columns and index using dictionary (name: values)
This method allows us to define both columns and custom row labels by passing a dictionary of column names and their corresponding values. It’s useful when we want fully labeled data for easier access and readability.
# Create a DataFrame using dictionary formdf = pd.DataFrame({'a': [1, 4, 7],'b': [2, 5, 8],'c': [3, 6, 9]}, index=['row1', 'row2', 'row3'])
Specify both columns and index using
columns=andindex=with a 2D array
Here, we pass the data as a 2D array and explicitly define column and row labels using the columns= and index= parameters. This is flexible for structured datasets where both dimensions need clear labeling.
# Create a DataFrame using 2D array and explicitly specifying columns and indexdf = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 8, 9]],index=['row1', 'row2', 'row3'],columns=['a', 'b', 'c'])
Rename columns in a DataFrame
When working with machine learning datasets, column names are often generic or unclear. Renaming them improves readability and makes feature tracking easier during preprocessing or model training.
# Rename columnsdf.columns = ['Age', 'Income', 'Purchased']
Drop rows
Remove specific rows (or elements in a Series) by label or index.
# For a Series: remove elements with labels 'a' and 'c's.drop(['a', 'c'])# For a DataFrame: remove rows with index 0 and 2df.drop([0, 2], axis=0) # axis=0 tells pandas to drop rows
Dropping rows with irrelevant or missing data helps clean the dataset before model training.
Drop columns
Remove one or more entire columns from a DataFrame.
# Remove the column named 'City' from the DataFramedf.drop('City', axis=1) # axis=1 tells pandas to drop columns
Dropping unneeded features prevents irrelevant data from affecting model performance.
Set and reset index
We can set a column as the index of a DataFrame for easier access and organization, and later reset it back to the default numeric index if needed.
# Sample DataFramedf = pd.DataFrame({'ID': [101, 102, 103],'Name': ['Alice', 'Bob', 'Charlie'],'Score': [85, 90, 78]})# Set 'ID' column as the indexdf.set_index('ID', inplace=True)# Reset index back to default numeric indexdf.reset_index(inplace=True)
A set index makes data easier to access and reference by meaningful labels instead of numeric positions. While resetting the index is useful after filtering or sorting to restore a clean, default index for iteration or export.
Handle duplicates
Duplicate rows in a dataset can cause bias or redundancy in analysis and machine learning. Pandas makes it easy to remove duplicate rows.
# Remove duplicate rowsdf_clean = df.drop_duplicates()
Sorting and ranking
Sorting and ranking data help arrange values in a meaningful sequence for analysis and comparison. These operations are commonly used to identify top or bottom records, prioritize features, and prepare data for reporting or modeling.
Sort Series by index
Sort a Series by its index labels. Returns a new Series sorted by index labels if inplace=False; otherwise, it updates the original Series and returns None.
# Sort a Series by its index with options for order, algorithm, handling NaNs, and in-place modificationSeries.sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True)
Sorting by index ensures consistent ordering, which is important for data alignment, comparison, and visualization in ML pipelines.
Sort values along an axis
Sorting data by values can help in feature analysis, ranking, or identifying outliers. We can sort in ascending or descending order depending on the task.
df.sort_values(by='Sales') # Sort DataFrame by the column named 'Sales'# Sort DataFrame by 'Sales' in descending orderdf.sort_values(by='Sales', ascending=False)
Adding ranks to particular entries
Ranking values is useful for feature engineering or for creating ordinal features. Pandas allows us to rank values directly.
# Input datadf = pd.DataFrame({'sales': [100, 200, 150],'profit': [20, 30, 25]})# Rank all numeric columnsdf.rank() # Ranks each numeric column independently# Rank a specific columndf["sales"].rank() # Ranks only the 'sales' column
When using df.rank(), pandas assigns ranks based on the ascending order of values in each numeric column. In the 'sales' column, the smallest value, 100, receives rank 1; 150 receives rank 2; and 200 receives rank 3. Similarly, in the 'profit' column, 20 receives rank 1, 25 receives rank 2, and 30 receives rank 3. If you rank a specific column, such as df["sales"].rank(), only that column is ranked in the same way, returning a Series of ranks.
Descriptive information
Pandas provides multiple ways to inspect the structure and summary of a dataset. Accessing descriptive information and metadata helps identify missing values, understand feature types, and plan preprocessing steps before modeling.
Retrieve rows and columns description
Before training a model, knowing the shape of the data helps verify preprocessing steps and ensure the dataset matches the expected input dimensions.
df.shape
Describe columns of the DataFrame
Checking column names is important to identify features, targets, or columns requiring transformation before modeling.
df.columns
Retrieve index description
Understanding the index helps in aligning datasets, especially when merging or joining multiple tables in ML preprocessing.
df.index
Get information on DataFrame
df.info() gives a quick overview of data types, null values, and memory usage, which is essential for cleaning and preprocessing data for ML.
df.info()
Handling missing values
Dealing with missing values is a crucial step in preprocessing, as null or NaN values can cause errors or bias in machine learning models. Pandas provides tools to detect, count, and handle missing values efficiently.
Retrieve the number of non-NA values
Counting non-missing values helps quickly identify which columns have incomplete data. This is useful for planning imputation or data cleaning strategies.
df.count() # # Non-NA count per columndf['sales'].count() # Non-NA count for “sales” column
Detect missing values
Detecting missing values allows us to locate gaps in the dataset before deciding how to handle them.
df.isna() # Returns True for missing valuesdf.notna() # Returns True for non-missing values
Handle missing values
After detecting missing values, we can either fill them with a specific value or remove rows/columns containing nulls. This ensures the dataset is ready for modeling without introducing errors.
df.fillna(0) # Replace missing values with 0df.dropna() # Remove rows with any missing values
Aggregations and mathematical operations
Pandas provides a variety of aggregation and mathematical operations to summarize, transform, and analyze datasets. These operations are essential for feature engineering, data inspection, trend tracking, and preparing inputs for machine learning models.
Get the sum of values
Calculating the sum of values provides a quick way to aggregate numeric data. It can be used to generate total metrics, monitor cumulative trends over time, or create new features for machine learning models, such as total sales, total profit, or combined scores.
df.sum() # Sum of values for each numeric column in the DataFramedf["sales"].sum() # Sum of the column# cumulative sumdf.cumsum()# Example# sales# 0 10# 1 20# 2 15# 3 25# df.sum() result: 70# df.cumsum() result:# sales# 0 10# 1 30# 2 45# 3 70
Add or subtract from all values
Adding or subtracting constants is often part of feature scaling or creating new derived features.
s.add(2) # Add 2 to each element in the Seriess.sub(2) # Subtract 2 from each element in the Series
Multiply or divide all values
Multiplying or dividing helps with normalization, scaling, or generating new features.
s.mul(2) # Multiply each element by 2s.div(2) # Divide each element by 2
Find min/max values
Identifying extreme values is helpful for detecting outliers and understanding the range of features.
# DataFrame leveldf.min() # Returns min of all numeric columnsdf.max() # Returns max of all numeric columns# Column leveldf['sales'].min()df['sales'].max()
Get min/max index values
Knowing which row contains the min or max is useful for data inspection or targeting specific samples.
# Index of minimum value per column (default axis=0)df.idxmin()# Index of maximum value per column (default axis=0)df.idxmax()# Index of minimum value per rowdf.idxmin(axis=1)# Index of maximum value per rowdf.idxmax(axis=1)
Get the median or mean of the values
Calculating mean or median is essential for understanding feature distributions and for imputation strategies.
# DataFrame leveldf.mean() # Mean of each numeric columndf.median() # Median of each numeric column# Column leveldf['profit'].mean()df['profit'].median()
Value counts and unique values
Counting occurrences and finding unique values helps understand feature distribution and categorical variables.
df['category'].value_counts() # Count of each unique valuedf['category'].unique() # List of unique valuesdf['category'].nunique() # Number of unique values
GroupBy operations
Grouping data enables aggregation, transformation, or applying custom functions to subsets. Useful for creating aggregated features.
df.groupby('region')['sales'].sum() # Sum of sales per region# Multiple aggregationsdf.groupby('region').agg({'sales':'mean', 'profit':'max'}) # Group by 'region' and compute mean sales and max profit per group
Describe a summary of data statistics
Provides descriptive statistics (mean, std, quartiles, min, max), which is key for understanding features and preprocessing decisions.
# DataFrame leveldf.describe() # Mean, std, min, 25%, 50%, 75%, max for all numeric columns# Column leveldf['sales'].describe() # Summary stats for 'sales' column only
Combining and merging data
Combining and merging data allows us to integrate multiple datasets into a single cohesive DataFrame. These operations are essential for preparing complete datasets for analysis or machine learning workflows.
Merge two DataFrames
Merging datasets is common when combining features from multiple sources before training a model. Pandas merge() works like SQL joins and allows us to combine data on a key column or index.
# Merge two DataFrames based on a common column 'subject_id', similar to a SQL joinpd.merge(df1, df2, on='subject_id')# Merge using a left joinmerged_df = pd.merge(df1, df2, on='subject_id', how='left')# Merge on different column namesmerged_df = pd.merge(df1, df2, left_on='id_1', right_on='id_2')# Merge using indicesmerged_df = pd.merge(df1, df2, left_index=True, right_index=True)# Validate merge to ensure one-to-one mappingmerged_df = pd.merge(df1, df2, on='subject_id', validate='one_to_one')
how: Type of merge (e.g.,'inner','left','right','outer').on,left_on,right_on: Columns to merge on.left_index,right_index: Merge using index instead of columns.validate: Ensures expected merge type, e.g.,one-to-oneormany-to-one.
Combine DataFrames
Concatenating DataFrames helps append additional data or stack multiple datasets for larger ML training sets.
print(pd.concat([df1, df2]))
merge combines DataFrames based on a common column or index, performing database-style joins, whereas concat simply appends or stacks them.
Pivot tables
Pivot tables help aggregate and summarize data across multiple dimensions, similar to Excel pivot tables. They are useful for analyzing trends, computing averages, sums, or other statistics grouped by categories.
# Create a pivot table showing average sales per region and productpivot_df = df.pivot_table(values='sales', index='region', columns='product', aggfunc='mean')# Example output# product A B# region# East 100.0 150.0# West 120.0 130.0
Pivot tables are important for aggregating features, identifying patterns, and creating structured inputs for modeling.
Melt/reshape
The melt() function reshapes a DataFrame from wide to long format, making it easier to analyze, visualize, or prepare for machine learning models that require tidy data.
# Melt the DataFrame to long formatmelted_df = df.melt(id_vars=['region'], value_vars=['sales', 'profit'],var_name='metric', value_name='value')# Example input# region sales profit# 0 East 100 30# 1 West 120 50# Example output# region metric value# 0 East sales 100# 1 West sales 120# 2 East profit 30# 3 West profit 50
Melted data is especially useful when performing group operations, visualizations, or feeding data into ML pipelines.
Apply a function to a dataset
Applying functions allows flexible transformations or feature engineering across a DataFrame.
# Apply a function along rows or columnsf = lambda x: x*2 # example functiondf.apply(f) # applies function along each column (default axis=0)# Apply a function element-wise to each value in the DataFramef = lambda x: x*2 # example functiondf.applymap(f) # applies function to every individual element
apply()works along an axis (rows or columns) and is useful for column-wise or row-wise transformations.applymap()works element-wise and is useful for transforming every value individually.
This is especially useful in ML pipelines for feature scaling, encoding, or custom transformations.
This free PDF provides an in-depth guide to using pandas for data manipulation and analysis, covering everything from basic data structures to advanced operations and visualizations. You can download it below.
Pandas is a versatile library that makes data analysis in Python efficient and intuitive. Mastering its key operations will give you the foundation to clean, manipulate, and explore datasets with confidence.