Discretize and Clip Data

Learn how to discretize and clip numerical data based on different cutoffs.

Overview

Another common way of handling numerical data is segmenting them at various cutoff values before applying some form of transformation based on these segments. The functions we’ll explore revolve around discretizing and clipping data.

Discretize data

Discretization is the process of dividing a range of continuous numerical values into discrete categories or bins. Despite the apparent disadvantage of information loss that comes with discretization, it also has numerous benefits:

  • It simplifies the data by reducing its dimensionality, making it easier to visualize and analyze.

  • It improves computational efficiency as discretization reduces the number of unique values that need to be considered.

  • It anonymizes data and protects sensitive information by aggregating data into a smaller number of bins so that the risk of identifying individuals is reduced.

  • It reduces noise in the data and influence of outliers because the binning of data minimizes their impact.

  • It allows the data to become more interpretable and intuitive.

The pandas functions that let us easily discretize the data into bins are cut() and qcut(). Let’s see these functions in action on a subset of the credit card dataset.

The cut() function

The cut() function transforms continuous variables into categorical variables by grouping them into discrete intervals. For example, we can bin the values in the Age column into four equal-sized groups in a new Age_Group column. We can do so by passing the integer 4 into the bins parameter.

Get hands-on with 1400+ tech skills courses.