Encode Categorical Data

Discover the basics of categorical data and how to encode it in pandas.

Overview of categorical data

Beyond numerical data, categorical data is another data type that is commonly seen in real-world datasets. Categorical data refers to data that can be divided into distinct groups based on specific characteristics and take on a limited number of unique values. It can be distinguished into two general types:

  • Nominal categorical data is where there is no order or ranking between categories. Examples of nominal data include gender, blood type, and hair color.

  • Ordinal categorical data is where the categories can be ordered or ranked. Examples of ordinal data include educational degrees (e.g., high school, bachelor's, master's), income groups (e.g., low, medium, high), and star ratings (e.g., one star to five stars).

Categorical data in pandas

In pandas, the categorical data type is represented as a Categorical object where the dtype='category'. A unique property of the Categorical data type is that although it appears like an array of string values, its internal data structure is represented by an array of integers that points to these categories. This feature results in the benefits of optimizing memory usage and improving performance for computations involving categorical data.

Suppose we have the following truncated credit card dataset that represents the demographics of a group of credit card holders:

Get hands-on with 1200+ tech skills courses.