Search⌘ K
AI Features

Challenge 5: Multiple Aggregations (Difficult)

Explore how to apply multiple aggregation functions simultaneously on grouped DataFrame columns using Pandas. Understand the use of .agg() with dictionaries to compute sum, mean, and max statistics by country, and learn how to flatten MultiIndex columns for easy access to the resulting data. By the end, you'll be able to convert grouped data into a nested dictionary format useful for analysis or reporting.

Problem definition

Your music analyst is interested in knowing multiple statistics at once, grouped by country.These are as follows:

  1. Sum of plays
  2. Average of plays
  3. Maximum fans from all artists in the country

Expected output

A dict object, with the keys being the country names, and the values being a dict of key-value pairs mapping to the names and values of the required statistics

Example:

{"US": {"fans_max": 10, "plays_sum": 30, "plays_mean": 40}, "UK": etc..}

Challenge

Python 3.5
import pandas as pd
def test():
df = pd.read_csv('music.csv')
pass

Solution

Python 3.5
import pandas as pd
def test():
df = pd.read_csv('music.csv')
grp = df.groupby('country').agg({'plays': ['sum', 'mean'], 'fans': ['max']})
grp.columns = ['_'.join(col) for col in grp.columns.values]
res = grp.to_dict(orient='index')
return res
print(test())

Solution explanation

There are multiple steps to finishing this challenge:

  1. Use the .agg() function to apply multiple functions to different columns in the data. There are multiple methods to call .agg(). The solution here utilizes the method of passing in a dict, where the keys are the column names, and the values are a list of names of the required operations.
  2. Calling .agg() using this method will result in a MultiIndex. So, the trick grp.columns = ['_'.join(col) for col in grp.columns.values] will flatten the index, generating column names as <column>_<operation>, e.g., fans_max.
  3. Now, you have a DataFrame, with the index being the country, and a few columns with the correct names. You call .to_dict(orient='index') to have a dict with the keys as the countries, and the values being a dict of col_name: value.