The categorical()
function in the pandas library is used to convert the data into categorical data types. The categorical data type represents data with a fixed number of possible values, known as categories. Using categorical data can significantly improve performance and reduce memory usage, especially when dealing with large datasets with repetitive values. Here are a few examples of categorical data:
Sex: 'Male'
, 'Female'
Grades: 'A'
, 'B'
, 'C'
, 'D'
, 'E'
Blood types: 'A'
, 'B'
, 'AB'
, 'O'
Product categories: 'Electronics'
, 'Clothing'
, 'Home Appliances'
Job titles: 'Manager'
, 'Engineer'
, 'Analyst'
, 'Director'
The pd.Categorical()
function in pandas is represented using the following function signature:
pd.Categorical(values, categories=None, ordered=None, dtype=None)
Study each parameter and its relevant importance:
Parameter | Category | Description |
| Required | This is an array-like object, which contains the data that needs to be converted into a categorical variable. It can be a list, NumPy array, or a pandas Series. |
| Optional | This is an array-like object, which specifies the categories for the categorical data. If not provided, the unique values in |
| Optional | It is a boolean flag, which indicates whether there's a clear sequence or order among the categories. For example, for data like "small," "medium," and "large." By default, its value is |
| Optional | It specifies the data type for the resulting categorical object. The best data type based on the input data is considered, if not provided. |
Let's review a few examples to understand the different usages of pd.Categorical()
.
We can use pd.Categorical()
to create a categorical data type column, which can be beneficial for memory and performance optimization. Now, operations involving categorical data can be faster due to the underlying use of integers for comparisons instead of string operations. For example, sorting a column's data with respect to some category. Moreover, functions like groupby()
function or aggregations like value_counts()
function can execute more swiftly on categorical data, especially with ordered categories.
Let's look at the example below, which demonstrates the difference in sorting performance between columns containing string data and columns containing categorical data in a pandas DataFrame.
import pandas as pdimport numpy as npimport time# Create a DataFrame with categorical and string datadata_size = 1000data = {# String data'Without_Categorical': pd.Series(np.random.choice(['A', 'B', 'C'], size=data_size)),# Categorical data using pd.Categorical'With_Categorical': pd.Categorical(np.random.choice(['X', 'Y', 'Z'], size=data_size))}df = pd.DataFrame(data)# Measuring time for sorting by column with stringsstart_time = time.time()df.sort_values('Without_Categorical')end_time = time.time()print(f"Time taken to sort by 'Without_Categorical': {round(end_time - start_time, 4)} seconds")# Measuring time for sorting by column with categorical datastart_time = time.time()df.sort_values('With_Categorical')end_time = time.time()print(f"\nTime taken to sort by 'With_Categorical': {round(end_time - start_time, 4)} seconds")
Let's break down the code above:
Lines 6–15: Create a DataFrame of the size 1000 that has one column, Without_Categorical
containing the string values 'A'
, 'B'
, and 'C'
, and another column, With_Categorical
, containing the string values 'X'
, 'Y'
and 'Z'
. The second column, however, is changed to the categorical data type.
Lines 18–27: Compare the sorting of the two columns, Without_Categorical
and With_Categorical
, by recording the time of using sort_values()
operation on both.
In this example, we’ll use pd.Categorical()
to convert a simple DataFrame’s column into a categorical type while specifying the order of categories. The idea is to demonstrate the benefits of specifying an order for categorical data in a DataFrame. Initially, we’ll display the random categorical data to show the unordered state. Then, we’ll define and assign an order, ['Low', 'Medium', 'High']
, to the categories. This ordered structure is then used to sort the DataFrame meaningfully, showcasing the impact of the specified order. Finally, we perform a comparison operation to demonstrate how ordering the categories allows for meaningful and logical comparisons, such as checking if a category is greater than 'Low'
.
import pandas as pdimport numpy as np# Create a DataFrame with categorical datanp.random.seed(0) # Set seed for reproducibilitydata = {'Category': np.random.choice(['Low', 'High', 'Medium'], size=1000)}df = pd.DataFrame(data)# Show the first few rows of the DataFrame without ordered categoriesprint("Initial DataFrame:")print(df['Category'].head(5))print("-"*100)# Define the order of categoriescategories = ['Low', 'Medium', 'High']df['Category'] = pd.Categorical(df['Category'], categories=categories, ordered=True)# Show the first few rows after ordering the categoriesprint("\nDataFrame with ordered categories:")print(df['Category'].head(5))print("-"*100)# Demonstrate sorting by the ordered categoriessorted_df = df.sort_values('Category')print("\nDataFrame sorted by Category:")print(sorted_df.head(10))print("-"*100)# Demonstrate comparisonsdf['Comparison'] = df['Category'] > 'Low'print("\nDataFrame with comparison 'Category' > 'Low':")print(df.head(10))
Let’s break down the code above:
Lines 6–7: Generate a dictionary data
with one key 'Category'
where the value is an array of 1000 random choices from the list ['Low', 'High', 'Medium']
. Then, use this dictionary to create a DataFrame df
.
Line 14: Create a list categories
that defines the desired order of the categorical data: ['Low', 'Medium', 'High']
. In this case, 'Low'
is considered the lowest, followed by 'Medium'
and 'High'
as the highest category, i.e., ('Low'
'Medium'
'High'
).
Line 15: Convert the 'Category'
column in the DataFrame to a pd.Categorical
type with the specified categories
order and set ordered=True
to indicate that the categories have a specific order.
Line 24: Sort the DataFrame by the 'Category'
column and store the result in sorted_df
.
Line 30: Create a new column 'Comparison'
in the DataFrame that contains the result of comparing each value in the 'Category'
column to 'Low'
. The comparison uses the specified order of the categories, where 'Medium'
and 'High'
are greater than 'Low'
.
Let’s look at another example. It shows how we can significantly reduce memory usage while maintaining the necessary information for analysis and operations on the data. This is helpful when dealing with large data sets.
import pandas as pdimport numpy as np# Create a sample datasetnp.random.seed(42)n = 1000payment_methods = np.random.choice(['Cash', 'Credit Card', 'Debit Card', 'Online Payment'], size=n)stores = np.random.randint(1, 101, size=n)data = {'Store_ID': stores, 'Payment_Method': payment_methods}df = pd.DataFrame(data)# Display memory usage before conversionprint("Memory usage before conversion:")print(df.memory_usage(deep=True))# Convert the 'Payment_Method' column to categoricaldf['Payment_Method'] = pd.Categorical(df['Payment_Method'])# Display memory usage after conversionprint("\nMemory usage after conversion:")print(df.memory_usage(deep=True))
Let’s break down the code above:
Lines 5–10: Create a DataFrame of size 1000 containing two columns Store_ID
and Payment_Method
.
Lines 13–21: Compare the memory usage of the DataFrame before using pd.Categorical()
and after using it.
Line 17: Convert the simple DataFrame’s column to a categorical type column.
In summary, utilizing the categorical function in pandas can significantly enhance the efficiency and clarity of our data analysis workflows. By converting the data into categorical types, we optimize memory usage and benefit from faster processing times. This powerful feature is essential for managing large datasets with repeated values, making your data manipulation tasks more streamlined and effective.
Free Resources