What is factorize function in Pandas?
The factorize function in Pandas encodes an object as a categorical variable. It provides a numerical representation for the given object. This is helpful when we need to identify unique values.
Syntax
The syntax of the factorize function is as follows:
pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)
Parameters
The factorize function takes in four parameters: values, sort, na_sentinel and size_hint.
Only the
valuesparameter is required. The rest are optional.
The description of each parameter is given below:
| Parameter | Description |
|---|---|
values |
Refers to a one-dimensional sequence such as a list. |
sort |
Sort unique values and shuffle codes to maintain the relationship. It takes a bool value. By default, it is False. |
na_sentinel |
Refers to value to mark as “not found”. If it is None, it will not drop the NaN from the uniqueness of the values. |
size_hint |
Hint to the hash table sizer. |
Return value
The factorize function returns two objects: codes and uniques.
codes refers to an integer ndarray that is an indexer into uniques.
uniques refers to valid unique values. It can be a ndarray, an index, or a categorical.
Example
The code snippet below shows how we can use the factorize function in Pandas:
import pandas as pdimport numpy as npcodes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])print("Codes", codes)print("Uniques", uniques)print('\n')# With sort = Truecodes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)print("Codes", codes)print("Uniques", uniques)print('\n')# Using a Categorical type# Custom categories for encoding are defined using the categories parameter.cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])codes, uniques = pd.factorize(cat)print("Codes", codes)print("Uniques", uniques)print('\n')
Free Resources