Unveiling the Numbers
Learn to analyze a single variable’s behavior using univariate statistics including mean, median, mode, variance, standard deviation, skewness, and kurtosis. This lesson helps you identify data distribution patterns, detect outliers, and build foundational analysis skills critical for deeper data exploration.
We'll cover the following...
Before building models or drawing conclusions, analysts must first understand how each variable behaves on its own. This lesson focuses on exploring central tendency, dispersion, and the shape of a distribution, including skewness and kurtosis.
Imagine we’re handed a complex machine with hundreds of moving parts, all connected and interacting. Before trying to understand how the machine works as a whole, wouldn’t we want to inspect each part individually? See how it moves, how it’s built, and whether it’s functioning properly. That’s exactly what univariate analysis does for our data.
In the last lesson, we briefly examined our data’s structure and types. Now, it’s time to listen more closely to what each column tells us. Each variable holds clues about typical values, spread, unusual observations, and overall shape that guide how we analyze the data.
In this lesson, we’ll focus on univariate statistics: exploring one variable at a time. We’ll measure its center, spread, and shape to understand its underlying distribution and spot any red flags.
What is univariate analysis?
Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, its central tendency (the typical value), and its variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables, just the characteristics of one variable by itself.
Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.
Informational note: Univariate analysis is the “solo performance” of the data. Each variable gets its moment in the spotlight to reveal its unique characteristics.
Descriptive statistics
In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.
Fun fact: Descriptive statistics are our data’s “elevator pitch”: quick, concise summaries that tell us the most important things in seconds.
Central tendency: Mean vs Median vs Mode
Central tendency metrics describe where the center of the data lies. For a data analyst, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product? There are the three key measures of central tendency:
Let’s start with the most familiar one: the mean.
Mean
The mean is the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.
For example, if five customers give ratings of
That may sound fine at first, but look closely: most ratings are around 4, yet the average is over 5. Why is that so? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.
Fun fact: The mean is so sensitive to outliers, it's often called the “average person at a billionaires’ party”, which means that a few extremes can drastically skew the perception!
Median
The median is the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.
Using the same five ratings,
Notice how the high rating of
Fun fact: The median is the fair referee, always finding the true middle, even when some scores are incredibly high or low!
Mean vs Median:
Understanding mean vs median is critical in data analysis:
If the mean > median → data is likely right-skewed
If the mean < median → data is likely left-skewed
This comparison helps explain the difference between average and median, especially when data contains extreme values.
Mode
This is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again, let's identify it with the same ratings as before:
The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.
Fun fact: The mode is the “popularity award winner” of our data, literally showing us what’s most common!
Example
To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas. This is so that we can gain a better understanding of what each one tells us.
ratings.mean(): Calculates the arithmetic average of all ratings.ratings.median(): Finds the middle value when the ratings are sorted.ratings.mode().iloc[0]: Returns the most frequent rating(s) in the Series and selects the first one using.iloc[0].
In output, the mean of
Dispersion
While central tendency helps us understand the most common or typical values in a dataset, dispersion shows how much the data varies. This is crucial for analysts trying to assess consistency or variability in customer behavior, product performance, or operational metrics. In other words: Are most ratings similar, or is there a lot of fluctuation?
Fun fact: If central tendency tells us where the target is, dispersion highlights how scattered the arrows are around that target!
There are several key measures of dispersion:
Variance
Variance is a statistical measure that represents the spread or dispersion of data points in a dataset. It tells us how much individual data points differ from the mean of the dataset. In simpler terms, variance gives us an idea of how spread out the values are. Variance is calculated by finding the average of the squared differences between each data point, and the mean.
Here
For example, if five customers give product ratings of
(twice)
The variance is their average:
Informational note: While mathematically useful, variance is often less intuitive in its raw form because it's in “squared units”, like measuring distance in “square miles” instead of miles.
Standard deviation
Standard deviation is the square root of the variance. Since variance is the square of the standard deviation, both metrics measure the same thing: the spread of the data. However, the standard deviation is often more intuitive because it’s expressed in the same units as the data, making it easier to interpret. Here is the formula to calculate standard deviation:
In the example above, the standard deviation is
Range
Range is the difference between the highest and lowest value. Using another set of ratings—
How to find outliers using Interquartile Range (IQR)
The IQR helps us understand the central 50% of values in a dataset by ignoring the lowest and highest extremes. It’s especially useful when the data is skewed or contains outliers, as it highlights where most values actually lie. Let’s walk through an example using the dataset:
Sort the data:
. Find the median (Q2): The middle value is
. Find the first quartile (Q1): Look at the lower half
. Since there are two numbers, we take their average to find the median: Find the third quartile (Q3): Look at the upper half
. Again, with two numbers, the median is the average: Calculate the IQR:
The middle 50% of the data lies between
Example
In this example, we’ll calculate above discussed metrics from the same list of customer ratings. This helps us understand how consistent the review scores are.
The output shows four key measures of how the customer ratings vary:
Standard deviation(
): On average, the ratings differ from the mean by about points, indicating moderate variability. Variance(
): This is the squared standard deviation, reflecting overall spread but in squared units. Range(
): The difference between the highest ( ) and lowest ( ) ratings shows the total spread, but sensitive to outliers. Interquartile range(
): The spread of the middle of ratings, a robust measure less affected by extreme values.
Together, these metrics help us understand how consistent or varied the customer ratings are.
Pandas data snapshots
After exploring central tendency and dispersion, the next step is to summarize our data efficiently. Before diving into deeper analysis, we need a quick look at how values are distributed and which categories are most frequent.
Pandas offers quick ways to summarize data. Here, we discuss two powerful functions: value_counts() for categories and describe() for numbers, giving us a clear snapshot of the data at a glance.
value_counts()is perfect for categorical data; it counts the frequency of each unique category, helping us see which categories dominate or are rare.describe()provides a comprehensive summary of numerical variables, including count, mean, standard deviation, minimum, quartiles, and maximum values. This statistical overview complements the central tendency and dispersion measures we’ve learned.
Fun fact: describe() is like the “executive summary” of our data, providing key takeaways without drowning us in details!
These tools are our go-to for quick and effective data summaries, letting us grasp the big picture before diving deeper. Here’s a simple example using both methods:
In output:
Category counts: Each category (
A,B,C) appearstimes in the dataset, as shown by .value_counts()on theCategorycolumn.Value summary statistics: The
Valuecolumn hasentries with an average ( mean) of, ranging from to . The quartiles ( , , ) show the distribution spread across , , and respectively.
By using value_counts() and describe() together, we gain fast and effective insights into the structure of our dataset: who’s in the data, and how values behave. These quick checks set the stage for deeper exploration and analysis.
Understanding the shape of data distributions
Understanding how data is distributed helps analysts spot patterns, identify unusual shapes, and decide if any adjustments, like transformations, are needed before deeper analysis. Distribution tells us whether values are clustered, spread out, or lean to one side. This is a key part of univariate analysis, where we examine a single variable in isolation to understand its behavior. Two key concepts help here:
Skewness of the distribution
Kurtosis
Skewness of the distribution
Skewness measures how much a distribution deviates from perfect symmetry around its center. It’s a key part of univariate analysis, helping us understand how a single variable behaves, whether it leans to the left or right. It also helps determine whether this asymmetry might distort statistical summaries, like the mean or affect model performance.
Right and left skewness
A skewness value close to
Let’s explore each case in detail:
1. Right skewness (positive skew)
In a right skewed distribution, the tail on the right side is longer than the left. This indicates that there are a few unusually high values pulling the mean to the right. Most data points are concentrated on the left side of the distribution, closer to the lower values. A common example is income distribution: while the majority earns moderate incomes, a few individuals with extremely high salaries stretch the distribution. As a result, the mean is greater than the median, which in turn is greater than the mode (Mean > Median > Mode).
2. Left skewness (negative skew)
A left-skewed distribution has a longer tail on the left side. This means that there are a few unusually low values pulling the mean downward, while the majority of data points cluster toward the higher end. The mean is less than the median, and the median is less than the mode (Mean < Median < Mode). An example of this skewness is the age at retirement, where most people retire around a common age, but a few retire much earlier, skewing the data to the left.
3. Zero skew (symmetrical)
When a dataset is perfectly symmetrical, it has zero skewness. This means the data is evenly distributed around the mean, and the left and right tails of the distribution are mirror images. In this case, the mean, median, and mode are all equal (Mean = Median = Mode). A classic example is a normal distribution or bell curve, such as
Fun fact: Symmetrical data is perfectly balanced, like a seesaw with equal weight on both sides.
Calculate skewness
Now that we’ve explored how skewness describes the asymmetry of a distribution, let’s put that understanding into practice. We’ll use Python’s scipy.stats.skew() function to calculate skewness for a simple dataset.
Here, we set bias=False to correct for small-sample bias, ensuring that our estimates are more accurate, especially when working with limited data.
In output, the skewness value is approximately
In practical terms, this could reflect a situation like customer response times where most people respond promptly, but a small number take significantly longer, introducing asymmetry into the distribution.
What is kurtosis?
When we look at data, it’s not enough to know the average or how spread out the values are. We also want to understand how the values are distributed, especially in the tails (the extremes). That’s where kurtosis comes in.
Kurtosis tells us how sharp the peak of a distribution is, and how
A kurtosis around 3 indicates a mesokurtic distribution. It is similar to the normal curve, with moderate tails and peak.
A value greater than 3 signals a leptokurtic shape. It has sharper peak and heavier tails, meaning more extreme values.
A value less than 3 points to a platykurtic shape. It has a flatter peak and lighter tails, meaning fewer outliers.
Let’s take a closer look at each type of kurtosis distribution:
1. Leptokurtic
A leptokurtic distribution has positive kurtosis. Leptokurtic distributions are characterized by tall, sharp peaks and heavy tails. This means that the data points are heavily concentrated near the mean, but there are also more extreme values or outliers that deviate from what a normal distribution looks like. The presence of heavy tails makes this type of distribution riskier in real-world scenarios like finance, where sudden large gains or losses can occur. Leptokurtic data suggests a high likelihood of rare, but significant deviations from the average.
2. Mesokurtic
A mesokurtic distribution has a moderate peak and tails, similar to that of a normal distribution. It indicates a balanced dataset without an excess of extreme values or outliers. This type of distribution is commonly seen in naturally occurring variables like IQ scores, where most data points lie close to the mean, and extreme scores are relatively rare. Mesokurtic is considered the “baseline” against which other kurtosis types are compared.
3. Platykurtic
Platykurtic distributions have negative kurtosis. Compared to a normal distribution, platykurtic distributions have flatter peaks and thinner tails. This suggests that the data is more evenly spread out, with fewer extreme values or outliers. While this might indicate a lesser risk of outliers in applications like manufacturing or quality control, it can also mean the data lacks strong central tendencies. An example could be a uniform distribution, where all values occur with roughly the same frequency:
When analyzing data, it’s crucial to understand not just the center but also the shape of the tails along with how likely extreme values or outliers are. Kurtosis measures this tailedness and indicates the likelihood of rare, extreme events.
To better compare distributions, excess kurtosis is used, which subtracts 3 from the kurtosis value. This helps us see if the data is more or less prone to extreme outliers compared to a normal distribution.
💡 Excess kurtosis = Kurtosis—3
It helps to compare this against normal distribution:
Excess > 0: More extreme outliers
Excess < 0: Less prone to outliers
Calculate kurtosis
Now that we’ve seen how kurtosis reflects the shape and tails of a distribution, let’s move from theory to practice. We’ll use Python’s scipy.stats.kurtosis() function to calculate kurtosis for a simple dataset.
The kurtosis() function offers two ways to interpret results. First, with fisher=True parameter, it returns excess kurtosis, where a normal distribution has a baseline of 0. This makes it easier to compare tailedness directly against the normal curve. Second, with fisher=False, it gives the Pearson kurtosis, where a normal distribution has a baseline of 3; this is the traditional kurtosis value.
The excess kurtosis value is
The Pearson kurtosis is
Wrap up
Univariate analysis is the foundational step in understanding data, where we inspect one variable at a time to uncover its typical values, spread, and shape. This process helps us detect outliers, identify skewness, and grasp the variability of the data. We use measures of central tendency, like mean, median, and mode, to understand where the data tends to cluster. When combined with measures of dispersion, such as variance, standard deviation, range, and interquartile range, we gain a fuller picture of how each variable behaves. Skewness and kurtosis further refine our understanding of distribution shape, which is crucial for subsequent analysis decisions. Gaining proficiency in univariate analysis allows us to prepare data thoughtfully and build intuition for deeper, multivariate exploration.
Quiz
Which of the following is a measure of central tendency?
Variance
Mean
Range
Skewness