Unveiling the Numbers
Discover how to analyze a single data variable's behavior through univariate analysis. Learn to measure central tendency, variability, and distribution shape using mean, median, mode, variance, standard deviation, skewness, and kurtosis. This lesson equips you to identify data patterns, spot outliers, and prepare for deeper analysis.
We'll cover the following...
Before building models or drawing conclusions, analysts must first understand how each variable behaves on its own. This lesson focuses on exploring central tendency, dispersion, and the shape of a distribution, including skewness and kurtosis.
Imagine we’re handed a complex machine with hundreds of moving parts, all connected and interacting. Before trying to understand how the machine works as a whole, wouldn’t we want to inspect each part individually? See how it moves, how it’s built, and whether it’s functioning properly. That’s exactly what univariate analysis does for our data.
In the last lesson, we briefly examined our data’s structure and types. Now, it’s time to listen more closely to what each column tells us. Each variable holds clues about typical values, spread, unusual observations, and overall shape that guide how we analyze the data.
In this lesson, we’ll focus on univariate statistics: exploring one variable at a time. We’ll measure its center, spread, and shape to understand its underlying distribution and spot any red flags.
What is univariate analysis?
Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, its central tendency (the typical value), and its variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables, just the characteristics of one variable by itself.
Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.
Informational note: Univariate analysis is the “solo performance” of the data. Each variable gets its moment in the spotlight to reveal its unique characteristics.
Descriptive statistics
In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.
Fun fact: Descriptive statistics are our data’s “elevator pitch”: quick, concise summaries that tell us the most important things in seconds.
Central tendency: Mean vs Median vs Mode
Central tendency metrics describe where the center of the data lies. For a data analyst, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product? There are the three key measures of central tendency:
Let’s start with the most familiar one: the mean.
Mean
The mean is the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.
For example, if five customers give ratings of
That may sound fine at first, but look closely: most ratings are around 4, yet the average is over 5. Why is that so? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.
Fun fact: The mean is so sensitive to outliers, it's often called the “average person at a billionaires’ party”, which means that a few extremes can drastically skew the perception!
Median
The median is the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.
Using the same five ratings,
Notice how the high rating of
Fun fact: The median is the fair referee, always finding the true middle, even when some scores are incredibly high or low!
Mean vs Median:
Understanding mean vs median is critical in data analysis:
If the mean > median → data is likely right-skewed
If the mean < median → data is likely left-skewed
This comparison helps explain the difference between average and median, especially when data contains extreme values.
Mode
This is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again, let's identify it with the same ratings as before:
The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.
Fun fact: The mode is the “popularity award winner” of our data, literally showing us what’s most common!
Example
To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas. This is so that we can gain a better understanding of what each one tells us.
ratings.mean(): Calculates the arithmetic average of all ratings.ratings.median(): Finds the middle value when the ratings are sorted.ratings.mode().iloc[0]: Returns the most frequent rating(s) in the Series and selects the first one using.iloc[0].
In output, the mean of
Dispersion
While central tendency helps us understand the most common or typical values in a dataset, dispersion shows how much the data varies. This is crucial for analysts trying to assess consistency or variability in customer behavior, product performance, or operational metrics. In other words: Are most ratings similar, or is there a lot of fluctuation? ...