What is factorize function in Pandas?

The factorize function in Pandas encodes an object as a categorical variable. It provides a numerical representation for the given object. This is helpful when we need to identify unique values.

Syntax

The syntax of the factorize function is as follows:

pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)

Parameters

The factorize function takes in four parameters: values, sort, na_sentinel and size_hint.

Only the values parameter is required. The rest are optional.

The description of each parameter is given below:

Parameter	Description
`values`	Refers to a one-dimensional sequence such as a list.
`sort`	Sort unique values and shuffle codes to maintain the relationship. It takes a `bool` value. By default, it is `False`.
`na_sentinel`	Refers to value to mark as “not found”. If it is `None`, it will not drop the `NaN` from the uniqueness of the values.
`size_hint`	Hint to the hash table sizer.

Return value

The factorize function returns two objects: codes and uniques.

codes refers to an integer ndarray that is an indexer into uniques.

uniques refers to valid unique values. It can be a ndarray, an index, or a categorical.

Example

The code snippet below shows how we can use the factorize function in Pandas:

import pandas as pd
import numpy as np
codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
print("Codes", codes)
print("Uniques", uniques)
print('\n')
# With sort = True
codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
print("Codes", codes)
print("Uniques", uniques)
print('\n')
# Using a Categorical type
# Custom categories for encoding are defined using the categories parameter.
cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
codes, uniques = pd.factorize(cat)
print("Codes", codes)
print("Uniques", uniques)
print('\n')

Free Resources