Shape It Right
Learn to reshape and prepare structured data for clear, analysis-ready insights using pandas.
As data analysts, we rely on structure to make sense of information. Imagine trying to analyze survey responses where each answer is stored in a separate file, or trying to compare monthly sales when each month is its own column. That kind of clutter makes it nearly impossible to run clean comparisons or build effective visuals.
That’s where data reshaping comes in. Reshaping is about turning scattered, inconsistent structures into tidy, streamlined tables. This means each row is an observation, each column a variable, and every piece of data fits into place.
In this lesson, we’ll unpack what tidy data really means, explore wide vs. long formats, and get hands-on with pandas tools like melt()
, pivot()
, pivot_table()
, stack()
, and unstack()
, so we can reshape any DataFrame to suit our analysis.
What is tidy data?
“Tidy” sounds like a colloquial term, right? In technical terms, however, tidy data follows three simple rules:
Each variable forms a column. Every distinct attribute or measurement is stored in its own column. For example, in a student dataset,
Name
,Subject
, andScore
should each be in separate columns, not combined into one.
Name | Subject | Score |
Alice | Math | 89 |
Bob | Math | 77 |
Each observation forms a row. Each row represents one complete set of measurements or attributes for a single entity or event. For example, a single row for “Alice’s Math score” means that her
Name
,Subject
, andScore
are all in one row.
Name | Subject | Score |
Alice | Math | 89 |
Alice | Science | 90 |
Each type of observational unit forms a table. Different entities or observational types should be stored in separate tables to avoid mixing unrelated data. For example, use one table for student scores:
Name | Subject | Score |
Alice | Math | 89 |
And a separate table for teacher information:
Name | Subject | Room |
Mr. John | Math | 101 |
This consistent and predictable structure is essential because many data manipulation and visualization tools expect data to be tidy. When data is tidy, we can easily apply filters, groupings, summaries, and charts without complicated reshaping.
Wide format vs. long format
Understanding how our data is structured is key to effective analysis and visualization. Two common data shapes, namely wide and long formats, determine how we organize variables and observations. Knowing the difference between the two helps us decide when, and how to reshape the data for different tasks.
Wide format: In wide format, similar measurements are spread across multiple columns. This can be convenient for quick human inspection, but is harder to automate.
For example, monthly sales might be stored in separate columns like Jan_Sales
, Feb_Sales
, ...