How to use categorical function in pandas

The categorical() function in the pandas library is used to convert the data into categorical data types. The categorical data type represents data with a fixed number of possible values, known as categories. Using categorical data can significantly improve performance and reduce memory usage, especially when dealing with large datasets with repetitive values. Here are a few examples of categorical data:

  • Sex: 'Male', 'Female'

  • Grades: 'A', 'B', 'C', 'D', 'E'

  • Blood types: 'A', 'B', 'AB', 'O'

  • Product categories: 'Electronics', 'Clothing', 'Home Appliances'

  • Job titles: 'Manager', 'Engineer', 'Analyst', 'Director'

Syntax

The pd.Categorical() function in pandas is represented using the following function signature:

pd.Categorical(values, categories=None, ordered=None, dtype=None)

Study each parameter and its relevant importance:

Parameter

Category

Description

values

Required

This is an array-like object, which contains the data that needs to be converted into a categorical variable. It can be a list, NumPy array, or a pandas Series.

categories

Optional

This is an array-like object, which specifies the categories for the categorical data. If not provided, the unique values in values are considered as categories.

ordered

Optional

It is a boolean flag, which indicates whether there's a clear sequence or order among the categories. For example, for data like "small," "medium," and "large." By default, its value is False.

dtype

Optional

It specifies the data type for the resulting categorical object. The best data type based on the input data is considered, if not provided.

Examples

Let's review a few examples to understand the different usages of pd.Categorical().

Comparing categorical data with string data

We can use pd.Categorical() to create a categorical data type column, which can be beneficial for memory and performance optimization. Now, operations involving categorical data can be faster due to the underlying use of integers for comparisons instead of string operations. For example, sorting a column's data with respect to some category. Moreover, functions like groupby() function or aggregations like value_counts() function can execute more swiftly on categorical data, especially with ordered categories.

Conversion to a categorical data
Conversion to a categorical data

Let's look at the example below, which demonstrates the difference in sorting performance between columns containing string data and columns containing categorical data in a pandas DataFrame.

import pandas as pd
import numpy as np
import time
# Create a DataFrame with categorical and string data
data_size = 1000
data = {
# String data
'Without_Categorical': pd.Series(np.random.choice(['A', 'B', 'C'], size=data_size)),
# Categorical data using pd.Categorical
'With_Categorical': pd.Categorical(np.random.choice(['X', 'Y', 'Z'], size=data_size))
}
df = pd.DataFrame(data)
# Measuring time for sorting by column with strings
start_time = time.time()
df.sort_values('Without_Categorical')
end_time = time.time()
print(f"Time taken to sort by 'Without_Categorical': {round(end_time - start_time, 4)} seconds")
# Measuring time for sorting by column with categorical data
start_time = time.time()
df.sort_values('With_Categorical')
end_time = time.time()
print(f"\nTime taken to sort by 'With_Categorical': {round(end_time - start_time, 4)} seconds")

Let's break down the code above:

  • Lines 6–15: Create a DataFrame of the size 1000 that has one column, Without_Categorical containing the string values 'A', 'B', and 'C', and another column, With_Categorical, containing the string values 'X', 'Y' and 'Z'. The second column, however, is changed to the categorical data type.

  • Lines 18–27: Compare the sorting of the two columns, Without_Categorical and With_Categorical, by recording the time of using sort_values() operation on both.

Specifying the order of categories

In this example, we’ll use pd.Categorical() to convert a simple DataFrame’s column into a categorical type while specifying the order of categories. The idea is to demonstrate the benefits of specifying an order for categorical data in a DataFrame. Initially, we’ll display the random categorical data to show the unordered state. Then, we’ll define and assign an order, ['Low', 'Medium', 'High'], to the categories. This ordered structure is then used to sort the DataFrame meaningfully, showcasing the impact of the specified order. Finally, we perform a comparison operation to demonstrate how ordering the categories allows for meaningful and logical comparisons, such as checking if a category is greater than 'Low'.

import pandas as pd
import numpy as np
# Create a DataFrame with categorical data
np.random.seed(0) # Set seed for reproducibility
data = {'Category': np.random.choice(['Low', 'High', 'Medium'], size=1000)}
df = pd.DataFrame(data)
# Show the first few rows of the DataFrame without ordered categories
print("Initial DataFrame:")
print(df['Category'].head(5))
print("-"*100)
# Define the order of categories
categories = ['Low', 'Medium', 'High']
df['Category'] = pd.Categorical(df['Category'], categories=categories, ordered=True)
# Show the first few rows after ordering the categories
print("\nDataFrame with ordered categories:")
print(df['Category'].head(5))
print("-"*100)
# Demonstrate sorting by the ordered categories
sorted_df = df.sort_values('Category')
print("\nDataFrame sorted by Category:")
print(sorted_df.head(10))
print("-"*100)
# Demonstrate comparisons
df['Comparison'] = df['Category'] > 'Low'
print("\nDataFrame with comparison 'Category' > 'Low':")
print(df.head(10))

Let’s break down the code above:

  • Lines 6–7: Generate a dictionary data with one key 'Category' where the value is an array of 1000 random choices from the list ['Low', 'High', 'Medium']. Then, use this dictionary to create a DataFrame df.

  • Line 14: Create a list categories that defines the desired order of the categorical data: ['Low', 'Medium', 'High']. In this case, 'Low' is considered the lowest, followed by 'Medium' and 'High' as the highest category, i.e., ('Low' << 'Medium' << 'High').

  • Line 15: Convert the 'Category' column in the DataFrame to a pd.Categorical type with the specified categories order and set ordered=True to indicate that the categories have a specific order.

  • Line 24: Sort the DataFrame by the 'Category' column and store the result in sorted_df.

  • Line 30: Create a new column 'Comparison' in the DataFrame that contains the result of comparing each value in the 'Category' column to 'Low'. The comparison uses the specified order of the categories, where 'Medium' and 'High' are greater than 'Low'.

Memory optimization

Let’s look at another example. It shows how we can significantly reduce memory usage while maintaining the necessary information for analysis and operations on the data. This is helpful when dealing with large data sets.

import pandas as pd
import numpy as np
# Create a sample dataset
np.random.seed(42)
n = 1000
payment_methods = np.random.choice(['Cash', 'Credit Card', 'Debit Card', 'Online Payment'], size=n)
stores = np.random.randint(1, 101, size=n)
data = {'Store_ID': stores, 'Payment_Method': payment_methods}
df = pd.DataFrame(data)
# Display memory usage before conversion
print("Memory usage before conversion:")
print(df.memory_usage(deep=True))
# Convert the 'Payment_Method' column to categorical
df['Payment_Method'] = pd.Categorical(df['Payment_Method'])
# Display memory usage after conversion
print("\nMemory usage after conversion:")
print(df.memory_usage(deep=True))

Let’s break down the code above:

  • Lines 5–10: Create a DataFrame of size 1000 containing two columns Store_ID and Payment_Method.

  • Lines 13–21: Compare the memory usage of the DataFrame before using pd.Categorical() and after using it.

  • Line 17: Convert the simple DataFrame’s column to a categorical type column.

In summary, utilizing the categorical function in pandas can significantly enhance the efficiency and clarity of our data analysis workflows. By converting the data into categorical types, we optimize memory usage and benefit from faster processing times. This powerful feature is essential for managing large datasets with repeated values, making your data manipulation tasks more streamlined and effective.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved