Search⌘ K
AI Features

Pandas Cheat Sheet

Discover how to effectively use the pandas library to manipulate, clean, and analyze data in Python. Learn key concepts including data structures, file operations, selection, modifying data, handling missing values, aggregations, and combining datasets. Gain practical skills to preprocess and prepare data for machine learning workflows.

Pandas is one of the most popular tools for data analysis in Python. This open-source library is the backbone of many data projects and is used for data cleaning and data manipulation.

With pandas, you gain greater control over complex datasets. It’s an essential tool in the data analysis tool belt. If you’re not using pandas, you’re not making the most of your data.

In this lesson, we’ll explore a quick guide to the most essential operations and commands.

Getting started with pandas

Pandas is a powerful Python library that makes data manipulation, analysis, and preprocessing intuitive and efficient. It provides flexible data structures such as Series and DataFrames, which are essential for handling structured data in Python.

Pandas import convention

Import the pandas library using the conventional alias pd for data manipulation.

import pandas as pd

We can install pandas using pip:

$ pip install pandas

Create and name a Series

A pandas series is a one-dimensional array that can hold any data type. We can create it using pd.Series() by passing a list of values, and optionally assign custom indices for each element.

# Create a Series with custom indices
series1 = pd.Series([1,2,3,4], index=['a','b','c','d'])

Assign a descriptive name to a Series for clarity.

# Assign a meaningful name to the Series
series1.name = "Monthly Sales"

Series name helps identify what the data represents when exporting, plotting, or combining with other data.

Next, name the index of a Series for better organization.

series1.index.name = "Product ID"

Index name makes it easier to understand and reference elements, especially when working with multiple Series.

Create a DataFrame

A DataFrame is a 2D table-like structure in pandas, composed of rows and columns. By default, pandas creates numeric column names 0, 1, … and row indices 0, 1, ….

# Create DataFrame from a 2D list
data = [[10, 20], [30, 40], [50, 60]]
df = pd.DataFrame(data)
# Output
# 0 1
# 0 10 20
# 1 30 40
# 2 50 60

DataFrames are the core data structure for preprocessing features, analyzing patterns, and feeding data to ML models.

Reading and writing data

Pandas supports a wide range of file formats, making it versatile for different data sources. Learning how to efficiently read and write data ensures our workflows are reproducible and ready for analysis or modeling.

Read and write to CSV file

CSV files are commonly used to store datasets. Efficiently reading and writing CSVs allows smooth preprocessing, experimentation, and sharing of data.

We can specify a column to use as the DataFrame index when reading a CSV. The first column is numbered 0, but we can also use column names. If no index is specified, pandas will create a default numeric index starting from 0.

import pandas as pd
# Read CSV using full file path
data = pd.read_csv('/path/to/file.csv') # Specify full path to file
# Read CSV without specifying an index (default numeric index will be created)
data = pd.read_csv('file.csv') # File is in current working directory
# Read CSV and use the first column as the index
data = pd.read_csv('data.csv', index_col=0)
# Read CSV and use a column by name as the index
data = pd.read_csv('data.csv', index_col='ID')
# Write DataFrame to CSV (exclude index if not needed)
data.to_csv('output.csv', index=False)
Read and write CSV files with optional indexing
  • index_col=0: Uses the first column as the index.

  • index_col='ID': Uses the column named 'ID' as the index.

  • If no index is specified, pandas generates a default numeric index 0, 1, 2….

This ensures our data is loaded with the correct indexing for downstream analysis and can be saved back to CSV efficiently.

Read and write to an Excel file

Call the read_excel function to access an Excel file. Pass the name of the Excel file as an argument. Use to_excel to save a DataFrame back to an Excel file, optionally specifying the sheet name.

# Read an Excel file into a DataFrame
df = pd.read_excel('file.xlsx')
# Write the DataFrame to an Excel file, specifying the sheet name
df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet2')

Excel files are often used for small datasets and manual feature inspection before converting to CSV or DataFrame for ML tasks.

Read and write to SQL Query

Pandas can interact with SQL databases to load or save structured datasets, making it easier to feed data into ML pipelines.

from sqlalchemy import create_engine
import pandas as pd
# Create an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')
# Read data from SQL
pd.read_sql("SELECT * FROM my_table;", engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query("SELECT * FROM my_table;", engine)
# Write a DataFrame to SQL
df.to_sql('myDf', engine)
Read from and write to SQL database

Using pandas with SQL enables seamless data fetching, preprocessing, and saving directly from or to databases for ML workflows.

Selecting and accessing data

Selecting and accessing data efficiently is fundamental in pandas. It allows us to inspect and filter specific elements of the Series or DataFrame. Mastering these techniques is essential for preprocessing, analysis, and feature engineering in machine learning workflows.

Access elements in a Series

Retrieve elements by position or slice.

# First element
ser[0]
# First 5 elements
ser[:5]
# Last 5 elements
ser[-5:]
Access elements in a Series by position

Selecting specific values helps in inspecting features, labels, or subsets of data during preprocessing or debugging.

Select a single value in DataFrame

Access individual elements by position or label using pandas indexing methods. This is useful for debugging, checking feature values, and validating preprocessing steps.

# By position
# Using iloc (row index, column index)
df.iloc[[0],[0]]
# Using iat (faster, returns a single value)
df.iat[0, 0]
# By label
# Using loc (row label, column label)
df.loc[[0], ['Label']]
df.at[0, 'Label']
Access single values by position or label
  • iloc: Access elements by integer position. Returns a DataFrame or Series.

  • iat: Access a single element by integer position. Faster than iloc.

  • loc: Access elements by row and column labels. Returns a DataFrame or Series.

  • at: Access a single element by row and column labels. Faster than loc.

Access a DataFrame with a boolean index

In boolean indexing, we filter data using a boolean condition on one or more columns.

import pandas as pd
# Dictionary of lists
data = {
'name': ["name1", "name2", "name3", "name4"],
'degree': ["degree1", "degree2", "degree3", "degree4"],
'score': [1, 2, 3, 4]
}
df = pd.DataFrame(data)
# Boolean indexing: select rows where score > 2
filtered_df = df[df['score'] > 2]
Filter rows using Boolean indexing

Boolean indexing is key for filtering datasets, creating training/testing splits, and selecting rows based on conditions. The expression df['score'] > 2 generates a boolean vector [False, False, True, True], which is then used to filter the DataFrame.

Modifying DataFrames

DataFrames can be easily modified to add, remove, or rename columns and rows. These operations are essential for feature engineering, data cleaning, and preparing datasets for machine learning.

Create a new column in a DataFrame

Add a new column with default or calculated values.

df['new_column'] = 0 # Set all rows in 'new_column' to 0

Adding columns allows feature engineering or tracking new derived features for ML models.

Specify columns and index in a DataFrame

Defining columns and indices clearly organizes the data, making it easier to access, manipulate, and interpret during analysis or modeling. Proper labeling also helps avoid errors when performing operations like merging, filtering, or feature engineering.

  1. Specify only column names

This is the simplest way to organize the data. Use this approach when we want to create a DataFrame with default numeric indices but custom column names.

import pandas as pd
# Create a DataFrame and specify column names only
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])
  1. Specify both columns and index using dictionary (name: values)

This method allows us to define both columns and custom row labels by passing a dictionary of column names and their corresponding values. It’s useful when we want fully labeled data for easier access and readability.

# Create a DataFrame using dictionary form
df = pd.DataFrame({
'a': [1, 4, 7],
'b': [2, 5, 8],
'c': [3, 6, 9]
}, index=['row1', 'row2', 'row3'])
  1. Specify both columns and index using columns= and index= with a 2D array

Here, we pass the data as a 2D array and explicitly define column and row labels using the columns= and index= parameters. This is flexible for structured datasets where both dimensions need clear labeling.

# Create a DataFrame using 2D array and explicitly specifying columns and index
df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
index=['row1', 'row2', 'row3'],
columns=['a', 'b', 'c'])

Rename columns in a DataFrame

When working with machine learning datasets, column names are often generic or unclear. Renaming them improves readability and makes feature tracking easier during preprocessing or model training.

# Rename columns
df.columns = ['Age', 'Income', 'Purchased']

Drop rows

Remove specific rows (or elements in a Series) by label or index.

# For a Series: remove elements with labels 'a' and 'c'
s.drop(['a', 'c'])
# For a DataFrame: remove rows with index 0 and 2
df.drop([0, 2], axis=0) # axis=0 tells pandas to drop rows

Dropping rows with irrelevant or missing data helps clean the dataset before model training.

Drop columns

Remove one or more entire columns from a DataFrame.

# Remove the column named 'City' from the DataFrame
df.drop('City', axis=1) # axis=1 tells pandas to drop columns

Dropping unneeded features prevents irrelevant data from affecting model performance.

Set and reset index

We can set a column as the index of a DataFrame for easier access and organization, and later reset it back to the default numeric index if needed.

# Sample DataFrame
df = pd.DataFrame({
'ID': [101, 102, 103],
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 90, 78]
})
# Set 'ID' column as the index
df.set_index('ID', inplace=True)
# Reset index back to default numeric index
df.reset_index(inplace=True)
Set or reset DataFrame index

A set index makes data easier to access and reference by meaningful labels instead of numeric positions. While resetting the index is useful after filtering or sorting to restore a clean, default index for iteration or export.

Handle duplicates

Duplicate rows in a dataset can cause bias or redundancy in analysis and machine learning. Pandas makes it easy to remove duplicate rows.

# Remove duplicate rows
df_clean = df.drop_duplicates()

Sorting and ranking

Sorting and ranking data help arrange values in a meaningful sequence for analysis and comparison. These operations are commonly used to identify top or bottom records, prioritize features, and prepare data for reporting or modeling.

Sort Series by index

Sort a Series by its index labels. Returns a new Series sorted by index labels if inplace=False; otherwise, it updates the original Series and returns None.

# Sort a Series by its index with options for order, algorithm, handling NaNs, and in-place modification
Series.sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True)

Sorting by index ensures consistent ordering, which is important for data alignment, comparison, and visualization in ML pipelines.

Sort values along an axis

Sorting data by values can help in feature analysis, ranking, or identifying outliers. We can sort in ascending or descending order depending on the task.

df.sort_values(by='Sales') # Sort DataFrame by the column named 'Sales'
# Sort DataFrame by 'Sales' in descending order
df.sort_values(by='Sales', ascending=False)

Adding ranks to particular entries

Ranking values is useful for feature engineering or for creating ordinal features. Pandas allows us to rank values directly.

# Input data
df = pd.DataFrame({
'sales': [100, 200, 150],
'profit': [20, 30, 25]
})
# Rank all numeric columns
df.rank() # Ranks each numeric column independently
# Rank a specific column
df["sales"].rank() # Ranks only the 'sales' column
Rank DataFrame values by column

When using df.rank(), pandas assigns ranks based on the ascending order of values in each numeric column. In the 'sales' column, the smallest value, 100, receives rank 1; 150 receives rank 2; and 200 receives rank 3. Similarly, in the 'profit' column, 20 receives rank 1, 25 receives rank 2, and 30 receives rank 3. If you rank a specific column, such as df["sales"].rank(), only that column is ranked in the same way, returning a Series of ranks.

Descriptive information

Pandas provides multiple ways to inspect the structure and summary of a dataset. Accessing descriptive information and metadata helps identify missing values, understand feature types, and plan preprocessing steps before modeling.

Retrieve rows and columns description

Before training a model, knowing the shape of the data helps verify preprocessing steps and ensure the dataset matches the expected input dimensions.

df.shape

Describe columns of the DataFrame

Checking column names is important to identify features, targets, or columns requiring transformation before modeling.

df.columns

Retrieve index description

Understanding the index helps in aligning datasets, especially when merging or joining multiple tables in ML preprocessing.

df.index

Get information on DataFrame

df.info() gives a quick overview of data types, null values, and memory usage, which is essential for cleaning and preprocessing data for ML.

df.info()

Handling missing values

Dealing with missing values is a crucial step in preprocessing, as null or NaN values can cause errors or bias in machine learning models. Pandas provides tools to detect, count, and handle missing values efficiently.

Retrieve the number of non-NA values

Counting non-missing values helps quickly identify which columns have incomplete data. This is useful for planning imputation or data cleaning strategies.

df.count() # # Non-NA count per column
df['sales'].count() # Non-NA count for “sales” column

Detect missing values

Detecting missing values allows us to locate gaps in the dataset before deciding how to handle them.

df.isna() # Returns True for missing values
df.notna() # Returns True for non-missing values

Handle missing values

After detecting missing values, we can either fill them with a specific value or remove rows/columns containing nulls. This ensures the dataset is ready for modeling without introducing errors.

df.fillna(0) # Replace missing values with 0
df.dropna() # Remove rows with any missing values

Aggregations and mathematical operations

Pandas provides a variety of aggregation and mathematical operations to summarize, transform, and analyze datasets. These operations are essential for feature engineering, data inspection, trend tracking, and preparing inputs for machine learning models.

Get the sum of values

Calculating the sum of values provides a quick way to aggregate numeric data. It can be used to generate total metrics, monitor cumulative trends over time, or create new features for machine learning models, such as total sales, total profit, or combined scores.

df.sum() # Sum of values for each numeric column in the DataFrame
df["sales"].sum() # Sum of the column
# cumulative sum
df.cumsum()
# Example
# sales
# 0 10
# 1 20
# 2 15
# 3 25
# df.sum() result: 70
# df.cumsum() result:
# sales
# 0 10
# 1 30
# 2 45
# 3 70
Summing and cumulative sum of values

Add or subtract from all values

Adding or subtracting constants is often part of feature scaling or creating new derived features.

s.add(2) # Add 2 to each element in the Series
s.sub(2) # Subtract 2 from each element in the Series

Multiply or divide all values

Multiplying or dividing helps with normalization, scaling, or generating new features.

s.mul(2) # Multiply each element by 2
s.div(2) # Divide each element by 2

Find min/max values

Identifying extreme values is helpful for detecting outliers and understanding the range of features.

# DataFrame level
df.min() # Returns min of all numeric columns
df.max() # Returns max of all numeric columns
# Column level
df['sales'].min()
df['sales'].max()

Get min/max index values

Knowing which row contains the min or max is useful for data inspection or targeting specific samples.

# Index of minimum value per column (default axis=0)
df.idxmin()
# Index of maximum value per column (default axis=0)
df.idxmax()
# Index of minimum value per row
df.idxmin(axis=1)
# Index of maximum value per row
df.idxmax(axis=1)

Get the median or mean of the values

Calculating mean or median is essential for understanding feature distributions and for imputation strategies.

# DataFrame level
df.mean() # Mean of each numeric column
df.median() # Median of each numeric column
# Column level
df['profit'].mean()
df['profit'].median()

Value counts and unique values

Counting occurrences and finding unique values helps understand feature distribution and categorical variables.

df['category'].value_counts() # Count of each unique value
df['category'].unique() # List of unique values
df['category'].nunique() # Number of unique values

GroupBy operations

Grouping data enables aggregation, transformation, or applying custom functions to subsets. Useful for creating aggregated features.

df.groupby('region')['sales'].sum() # Sum of sales per region
# Multiple aggregations
df.groupby('region').agg({'sales':'mean', 'profit':'max'}) # Group by 'region' and compute mean sales and max profit per group

Describe a summary of data statistics

Provides descriptive statistics (mean, std, quartiles, min, max), which is key for understanding features and preprocessing decisions.

# DataFrame level
df.describe() # Mean, std, min, 25%, 50%, 75%, max for all numeric columns
# Column level
df['sales'].describe() # Summary stats for 'sales' column only

Combining and merging data

Combining and merging data allows us to integrate multiple datasets into a single cohesive DataFrame. These operations are essential for preparing complete datasets for analysis or machine learning workflows.

Merge two DataFrames

Merging datasets is common when combining features from multiple sources before training a model. Pandas merge() works like SQL joins and allows us to combine data on a key column or index.

# Merge two DataFrames based on a common column 'subject_id', similar to a SQL join
pd.merge(df1, df2, on='subject_id')
# Merge using a left join
merged_df = pd.merge(df1, df2, on='subject_id', how='left')
# Merge on different column names
merged_df = pd.merge(df1, df2, left_on='id_1', right_on='id_2')
# Merge using indices
merged_df = pd.merge(df1, df2, left_index=True, right_index=True)
# Validate merge to ensure one-to-one mapping
merged_df = pd.merge(df1, df2, on='subject_id', validate='one_to_one')
Merge DataFrames: join by column/index
  • how: Type of merge (e.g., 'inner', 'left', 'right', 'outer').

  • on, left_on, right_on: Columns to merge on.

  • left_index, right_index: Merge using index instead of columns.

  • validate: Ensures expected merge type, e.g., one-to-one or many-to-one.

Combine DataFrames

Concatenating DataFrames helps append additional data or stack multiple datasets for larger ML training sets.

print(pd.concat([df1, df2]))

merge combines DataFrames based on a common column or index, performing database-style joins, whereas concat simply appends or stacks them.

Pivot tables

Pivot tables help aggregate and summarize data across multiple dimensions, similar to Excel pivot tables. They are useful for analyzing trends, computing averages, sums, or other statistics grouped by categories.

# Create a pivot table showing average sales per region and product
pivot_df = df.pivot_table(values='sales', index='region', columns='product', aggfunc='mean')
# Example output
# product A B
# region
# East 100.0 150.0
# West 120.0 130.0
Pivot table: average sales per region/product

Pivot tables are important for aggregating features, identifying patterns, and creating structured inputs for modeling.

Melt/reshape

The melt() function reshapes a DataFrame from wide to long format, making it easier to analyze, visualize, or prepare for machine learning models that require tidy data.

# Melt the DataFrame to long format
melted_df = df.melt(id_vars=['region'], value_vars=['sales', 'profit'],
var_name='metric', value_name='value')
# Example input
# region sales profit
# 0 East 100 30
# 1 West 120 50
# Example output
# region metric value
# 0 East sales 100
# 1 West sales 120
# 2 East profit 30
# 3 West profit 50
Reshape DataFrame from wide to long format

Melted data is especially useful when performing group operations, visualizations, or feeding data into ML pipelines.

Apply a function to a dataset

Applying functions allows flexible transformations or feature engineering across a DataFrame.

# Apply a function along rows or columns
f = lambda x: x*2 # example function
df.apply(f) # applies function along each column (default axis=0)
# Apply a function element-wise to each value in the DataFrame
f = lambda x: x*2 # example function
df.applymap(f) # applies function to every individual element
Apply functions column-wise or element-wise
  • apply() works along an axis (rows or columns) and is useful for column-wise or row-wise transformations.

  • applymap() works element-wise and is useful for transforming every value individually.

This is especially useful in ML pipelines for feature scaling, encoding, or custom transformations.


This free PDF provides an in-depth guide to using pandas for data manipulation and analysis, covering everything from basic data structures to advanced operations and visualizations. You can download it below.


Pandas is a versatile library that makes data analysis in Python efficient and intuitive. Mastering its key operations will give you the foundation to clean, manipulate, and explore datasets with confidence.