Categorical Manipulation
Explore the essentials of categorical data manipulation in pandas. Learn how to identify, convert, and optimize categorical data for memory and speed. Understand ordinal and nominal categories and leverage pandas' cat accessor for category management. This lesson also covers common pitfalls and methods for generalizing categories to handle real-world data effectively.
So far, we have dealt with numeric and date data. Another common form of data is textual data, and a subset of textual data is categorical data. Categorical data is textual data that has repetitions.
Categorical data
Categories are labels that describe data. Values are oftentimes repeated, and when they have an intrinsic order, they are referred to as ordinal values. One example is shirt sizes: small, medium, and large. Unordered values such as colors are called nominal values. We can convert numerical data to categories by binning them.
We’ll start by looking at the categorical values found in the fuel economy dataset. The make column has categorical information:
Frequency counts
We can use the value_counts method to determine the cardinality of the values. The frequency of values will tell us if a column is categorical. If every value is unique or free-form text, it’s not categorical:
We can also inspect the size and the number of unique items to infer the cardinality:
Benefits of categories
The first benefit of categorical values is that they use less memory. Let’s see how much memory an instance of ...