How to bin numerical columns into groups using pandas
Binning in pandas is the process of grouping a continuous numerical variable into a smaller number of discrete bins or groups. Binning numerical columns is a common data preprocessing technique in data analysis and machine learning.
This can be useful for summarizing or visualizing data and identifying patterns or trends in the data.
Binning converts a continuous numerical variable into a categorical variable by dividing it into discrete intervals or bins. This can help simplify the data and be useful for various purposes such as visualization, analysis, and modeling.
The cut() function in pandas
The pandas library provides a convenient way of binning numerical columns using the cut() function. This function takes a numerical column as input and divides it into equal-sized bins based on the specified number of bins or the bin edges provided.
Example
The following example code illustrates how to bin a numerical column using the cut() function.
import pandas as pd# Create sample DataFramedata = pd.DataFrame({'Unit': [5, 15, 20, 25, 30, 40, 45, 50]})# Bin the Age column into 3 equal-sized binsdata['UnitGroup'] = pd.cut(data['Unit'], bins=3)# Print the DataFrameprint(data)
Explanation
Line 1: Import
pandaswith thepdalias.Line 4: We create a data frame from a single column named
Unit.Line 7: We use the
cut()function to create a new column namedAgeGroupby binning theUnitcolumn into 3 equal-sized bins.Line 10: We print the new DataFrame which will have the
UnitGroupwith the corresponding bin labels for each row in theUnitcolumn.
If you want to display labels instead of numerical bins, you can utilize the qcut() function with labels parameter to categorize the data into equal-frequency bins, each labeled accordingly.
data['UnitGroup'] = pd.cut(data['Unit'], bins=2, labels=['slightly high', 'Very High'])
The qcut() function in pandas
Quantile cut, or qcut(), is a function that divides a set of values into bins according to the sample quantiles. With this function, values are divided into bins of equal sizes, which is helpful when working with skewed data.
Example
Here is an illustration of how to bin numerical columns in pandas using the qcut() function:
import pandas as pd# create a DataFrame with a numerical columndata = pd.DataFrame({'Unit': [5, 15, 20, 25, 30, 40, 45, 50]})# bin the numerical column into 3 groupsdata['QcutBin'] = pd.qcut(data['Unit'], q=3)# print the new DataFrame with the binned columnprint(data)
Explanation
Line 1: Import
pandaswith thepdalias.Line 4: We create a data frame from a single column named
Unit.Line 7: We use
qcut()to bin the values in that column into 3 groups, and store the results in a new column calledQcutBin. Theqparameter specifies the number of bins to create, in this case 3.Line 10: We print the new data frame with the binned column.
Conclusion
In this Answer, we learned how to use the cut() and qcut() functions in pandas to group numerical columns into equal-frequency groups. We can convert continuous data into categorical data, which can be simpler to analyze and interpret, by effectively using these functions.
Free Resources