Shape It Right
Learn to reshape and tidy the data with pandas.
As data scientists, we don’t just collect data—we shape it to ask better questions. Yet raw data rarely arrives in the structure we need. Sometimes it comes in wide form, with similar information scattered across multiple columns. Other times, it’s long and stacked, making comparisons or plotting more difficult.
Think of your dataset like a LEGO set: the same pieces can be arranged in different ways depending on what you want to build. Want to compare variables side by side? Go wide. Need to analyze trends across categories or time? Go long.
Reshaping data is not just about formatting—it’s about making data usable. In this lesson, we’ll unpack what tidy means, explore the difference between wide and long formats, and master reshaping tools in pandas—melt()
, pivot()
, pivot_table()
, stack()
, and unstack()
—so we can transform DataFrame to fit the task at hand.
What is tidy data?
Tidy data is a standardized way to organize datasets that makes analysis, modeling, and visualization much easier. It follows three fundamental principles:
Each variable forms a column. Every distinct attribute or measurement is stored in its own column.
Each observation forms a row. Each row represents one complete set of measurements or attributes for a single entity or event.
Each type of observational unit forms a table. Different entities or observational types should be stored in separate tables to avoid mixing unrelated data.
This consistent and predictable structure is essential because many data manipulation and visualization tools expect data to be tidy. When data is tidy, we can easily apply filters, groupings, summaries, and charts without complicated reshaping.
Wide format vs. long format
Understanding how our data is structured helps us decide how to reshape it.
Wide format: The data spreads variables across multiple columns. For example, monthly sales might be stored in separate columns like
Jan_Sales
,Feb_Sales
,Mar_Sales
. This makes it easy to compare values side-by-side, but can become unwieldy for functions or visualizations that require data in a stacked format. Let’s create a sample dataset and see how a wide format looks like:
import pandas as pd# Wide format exampledf_wide = pd.DataFrame({'Product': ['A', 'B', 'C'],'Jan_Sales': [100, 150, 200],'Feb_Sales': [110, 140, 210],'Mar_Sales': [120, 160, 220]})print("Wide format data:")print(df_wide)
Here, sales for each month are separate columns, this is wide format.
Long format: The data stacks repeated measurements into one column and uses another ...