Challenge 5: Multiple Aggregations (Difficult)
Explore how to apply multiple aggregation functions simultaneously on grouped DataFrame columns using Pandas. Understand the use of .agg() with dictionaries to compute sum, mean, and max statistics by country, and learn how to flatten MultiIndex columns for easy access to the resulting data. By the end, you'll be able to convert grouped data into a nested dictionary format useful for analysis or reporting.
We'll cover the following...
Problem definition
Your music analyst is interested in knowing multiple statistics at once, grouped by country.These are as follows:
- Sum of plays
- Average of plays
- Maximum fans from all artists in the country
Expected output
A dict object, with the keys being the country names, and the values being a dict of key-value pairs mapping to the names and values of the required statistics
Example:
{"US": {"fans_max": 10, "plays_sum": 30, "plays_mean": 40}, "UK": etc..}
Challenge
Solution
Solution explanation
There are multiple steps to finishing this challenge:
- Use the
.agg()function to apply multiple functions to different columns in the data. There are multiple methods to call.agg(). The solution here utilizes the method of passing in adict, where the keys are the column names, and the values are a list of names of the required operations. - Calling
.agg()using this method will result in aMultiIndex. So, the trickgrp.columns = ['_'.join(col) for col in grp.columns.values]will flatten the index, generating column names as<column>_<operation>, e.g.,fans_max. - Now, you have a
DataFrame, with the index being thecountry, and a few columns with the correct names. You call.to_dict(orient='index')to have adictwith the keys as the countries, and the values being adictofcol_name: value.