Shape the Story
Learn to create histograms, box plots, heatmaps, and stackplots with Matplotlib, and understand how these tools reveal distributions, outliers, and complex relationships in your data.
In our last lesson, we took our first exciting steps into data visualization, learning to tell basic stories with line, bar, scatter, and pie charts. Now, we’re ready to dive deeper. Imagine not just seeing the outline of a character in our data story, but truly understanding their personality, their range, and how they interact with others. In this lesson, we will explore powerful visualization techniques that help us understand the distribution of our data and the intricate relationships between variables, pushing our storytelling abilities beyond the basics.
Understanding data distribution
When we talk about “data distribution,” we’re asking: How are our values spread out? Are they clustered together? Do they lean toward one side? Are there any extreme values? Understanding distribution is essential to making sense of data.
Histogram
A histogram is a special kind of bar chart that helps us understand the distribution of a single numerical variable. Instead of showing categories, it groups data into “bins” (ranges) and then shows us how many data points fall into each bin. Think of it like sorting people by height into different height groups, and then counting how many people are in each group.
When we look at a histogram, we can quickly see where most of our data lies, whether the values are clustered around a certain number. We can also observe the shape of the data, noting if it’s spread evenly or if it has a long “tail” on one side (skewed). Furthermore, we can identify if there are multiple peaks, suggesting our data might contain different groups within it. For instance, if we’re looking at customer ages, a histogram can show us if most of our customers are young, old, or evenly distributed across all age groups. It’s an invaluable tool for our univariate analysis, giving us a visual sense of mean, median, and skewness.
Fun fact: Histograms are fundamental. They were first introduced by Karl Pearson, a pioneer in mathematical statistics, in the late 19th century. They remain one of the most effective ways to summarize the shape of continuous data.
To create a histogram in Matplotlib, use:
plt.hist(data, bins=10, color='skyblue', edgecolor='black')
This command creates a histogram where data
is your numerical column, bins
determines how many ranges the data is divided into, and the color
and edgecolor
help with visual clarity. ...