The groupby Function

Learn how to use Pandas groupby function for data analysis.

The lessons up to this point have covered data cleaning, manipulation, and processing with Pandas. Pandas is a great library for data analysis as well. In this chapter, we’ll go over Pandas functions that can be used to analyze tabular data.

Data analysis

Data analysis can be defined as the process of inferring insights, discovering useful information, and drawing results from the data at hand. It’s mainly done to support a decision-making process or to explore the data before creating a machine learning model.

In this lesson, we’ll use a new dataset called grocery that contains the daily sales quantities of some products at a grocery store. Let’s take a quick look at this dataset.

import pandas as pd
grocery = pd.read_csv("grocery.csv")
print("The size of the DataFrame:")
print(grocery.shape)
print("\nThe column names are:")
print(list(grocery.columns))
print("\nThe first five rows:")
print(grocery.head())

One of the most commonly used functions in data analysis is the groupby function. It groups observations (rows) according to the distinct values in a given column. Let’s say we have a DataFrame that contains the sales information about the products in a retail store. Each product belongs to a product group, which is indicated in the product_group column. By using the groupby function, we can group the products based on the product groups they belong to. Then, we can calculate a wide range of aggregations, such as average product price, daily total sales, and so on.

How to use the groupby function

The only requirement is to pass the name of the column used for grouping to the groupby function. Then, the rows are grouped based on the distinct values in the given column. The following drawing illustrates how the groupby function operates.

The groupby function creates the groups, but they don’t provide any information unless we do some aggregations. For instance, in the illustration above, if we apply the mean function to the price column, we can calculate the average product price for each product group. Let’s do this operation on the grocery.

import pandas as pd
grocery = pd.read_csv("grocery.csv")
print(grocery.groupby("product_group").mean())

As we see in the output above, once the groups are formed and the mean function is applied, Pandas calculates the mean value for all the numerical columns. So, we’re able to see the average sales quantities as well. Average product code is meaningless because the product code is just used as an identity.

If we’re only interested in the average price, we can select the columns before applying the groupby function. For instance, in line 5 of the following block of code, we first select the product_group and price columns from the grocery and then group the rows by the product_column. Finally, the mean function is applied to see the average price for each product group.

import pandas as pd
grocery = pd.read_csv("grocery.csv")
print(grocery[["product_group","price"]].groupby("product_group").mean())

We now see only the average price of the products in each group.