# The groupby Function

Learn how to use Pandas groupby function for data analysis.

## We'll cover the following

The lessons up to this point have covered data cleaning, manipulation, and processing with Pandas. Pandas is a great library for data analysis as well. In this chapter, we’ll go over Pandas functions that can be used to analyze tabular data.

## Data analysis

Data analysis can be defined as the process of inferring insights, discovering useful information, and drawing results from the data at hand. It’s mainly done to support a decision-making process or to explore the data before creating a machine learning model.

In this lesson, we’ll use a new dataset called `grocery`

that contains the daily sales quantities of some products at a grocery store. Let’s take a quick look at this dataset.

import pandas as pdgrocery = pd.read_csv("grocery.csv")print("The size of the DataFrame:")print(grocery.shape)print("\nThe column names are:")print(list(grocery.columns))print("\nThe first five rows:")print(grocery.head())

One of the most commonly used functions in data analysis is the `groupby`

function. It groups observations (rows) according to the distinct values in a given column. Let’s say we have a `DataFrame`

that contains the sales information about the products in a retail store. Each product belongs to a product group, which is indicated in the `product_group`

column. By using the `groupby`

function, we can group the products based on the product groups they belong to. Then, we can calculate a wide range of aggregations, such as average product price, daily total sales, and so on.

## How to use the `groupby`

function

The only requirement is to pass the name of the column used for grouping to the `groupby`

function. Then, the rows are grouped based on the distinct values in the given column. The following drawing illustrates how the `groupby`

function operates.

The `groupby`

function creates the groups, but they don’t provide any information unless we do some aggregations. For instance, in the illustration above, if we apply the `mean`

function to the `price`

column, we can calculate the average product price for each product group. Let’s do this operation on the `grocery`

.

import pandas as pdgrocery = pd.read_csv("grocery.csv")print(grocery.groupby("product_group").mean())

As we see in the output above, once the groups are formed and the mean function is applied, Pandas calculates the `mean`

value for all the numerical columns. So, we’re able to see the average sales quantities as well. Average product code is meaningless because the product code is just used as an identity.

If we’re only interested in the average price, we can select the columns before applying the `groupby`

function. For instance, in line **5** of the following block of code, we first select the `product_group`

and `price`

columns from the `grocery`

and then group the rows by the `product_column`

. Finally, the `mean`

function is applied to see the average price for each product group.

import pandas as pdgrocery = pd.read_csv("grocery.csv")print(grocery[["product_group","price"]].groupby("product_group").mean())

We now see only the average price of the products in each group.