Search⌘ K
AI Features

Unveiling the Numbers

Learn to analyze a single variable’s behavior using univariate statistics including mean, median, mode, variance, standard deviation, skewness, and kurtosis. This lesson helps you identify data distribution patterns, detect outliers, and build foundational analysis skills critical for deeper data exploration.

Before building models or drawing conclusions, analysts must first understand how each variable behaves on its own. This lesson focuses on exploring central tendency, dispersion, and the shape of a distribution, including skewness and kurtosis.

Imagine we’re handed a complex machine with hundreds of moving parts, all connected and interacting. Before trying to understand how the machine works as a whole, wouldn’t we want to inspect each part individually? See how it moves, how it’s built, and whether it’s functioning properly. That’s exactly what univariate analysis does for our data.

In the last lesson, we briefly examined our data’s structure and types. Now, it’s time to listen more closely to what each column tells us. Each variable holds clues about typical values, spread, unusual observations, and overall shape that guide how we analyze the data.

In this lesson, we’ll focus on univariate statistics: exploring one variable at a time. We’ll measure its center, spread, and shape to understand its underlying distribution and spot any red flags.

What is univariate analysis?

Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, its central tendency (the typical value), and its variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables, just the characteristics of one variable by itself.

Analyzing a single column
Analyzing a single column

Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.

Informational note: Univariate analysis is the “solo performance” of the data. Each variable gets its moment in the spotlight to reveal its unique characteristics.

Descriptive statistics

In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.

Fun fact: Descriptive statistics are our data’s “elevator pitch”: quick, concise summaries that tell us the most important things in seconds.

Central tendency: Mean vs Median vs Mode

Central tendency metrics describe where the center of the data lies. For a data analyst, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product? There are the three key measures of central tendency:

Central tendency measures
Central tendency measures

Let’s start with the most familiar one: the mean.

Mean

The mean is the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.

For example, if five customers give ratings of 3,4,4,5,103, 4, 4, 5, 10, the mean is:

That may sound fine at first, but look closely: most ratings are around 4, yet the average is over 5. Why is that so? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.

Fun fact: The mean is so sensitive to outliers, it's often called the “average person at a billionaires’ party”, which means that a few extremes can drastically skew the perception!

Median

The median is the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.

Using the same five ratings,3,4,4,5,103, 4, 4, 5, 10, let’s sort them in the following manner: 3,4,4,5,103, 4, 4, 5, 10. The median is the middle value, which is 4 in this case.

Notice how the high rating of 1010 doesn’t affect the median at all. That’s why median is often a better measure when our data has outliers; it offers a more stable picture of what’s typical.

Fun fact: The median is the fair referee, always finding the true middle, even when some scores are incredibly high or low!

Mean vs Median:

Understanding mean vs median is critical in data analysis:

  • If the mean > median → data is likely right-skewed

  • If the mean < median → data is likely left-skewed

This comparison helps explain the difference between average and median, especially when data contains extreme values.

Mode

This is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again, let's identify it with the same ratings as before: 3,4,4,5,103, 4, 4, 5, 10. The mode is the most frequent value, and here, it’s 4 because it appears twice.

The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.

Fun fact: The mode is the “popularity award winner” of our data, literally showing us what’s most common!

Example

To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas. This is so that we can gain a better understanding of what each one tells us.

Python 3.10.4
import pandas as pd
ratings = pd.Series([5, 4, 4, 3, 5, 2, 5, 1, 3, 4])
mean = ratings.mean()
median = ratings.median()
mode = ratings.mode().iloc[0]
print(f"Mean: {mean}, Median: {median}, Mode: {mode}")
  • ratings.mean(): Calculates the arithmetic average of all ratings.

  • ratings.median(): Finds the middle value when the ratings are sorted.

  • ratings.mode().iloc[0]: Returns the most frequent rating(s) in the Series and selects the first one using .iloc[0].

In output, the mean of 3.63.6 gives us the overall average rating, but it’s slightly pulled down by lower scores. The median and mode both point to 44, suggesting that most customers rated the product around that level.

Dispersion

While central tendency helps us understand the most common or typical values in a dataset, dispersion shows how much the data varies. This is crucial for analysts trying to assess consistency or variability in customer behavior, product performance, or operational metrics. In other words: Are most ratings similar, or is there a lot of fluctuation?

Fun fact: If central tendency tells us where the target is, dispersion highlights how scattered the arrows are around that target!

There are several key measures of dispersion:

Variance

Variance is a statistical measure that represents the spread or dispersion of data points in a dataset. It tells us how much individual data points differ from the mean of the dataset. In simpler terms, variance gives us an idea of how spread out the values are. Variance is calculated by finding the average of the squared differences between each data point, and the mean.

Here xix_i is each data point,μμis the mean of the data and NN is the number of data points.

For example, if five customers give product ratings of 3,4,4,5,103, 4, 4, 5, 10, the mean is 5.25.2. The squared differences from the mean are:

  • (35.2)²=4.84(3−5.2)² = 4.84

  • (45.2)²=1.44(4−5.2)² = 1.44 (twice)

  • (55.2)²=0.04(5−5.2)² = 0.04

  • (105.2)²=23.04(10−5.2)² = 23.04

The variance is their average: 6.166.16. This relatively high value tells us that the ratings are quite spread out around the mean. Most values hover near 33 to 55, but the rating of 1010 is far from the center and contributes heavily to the overall variance. This highlights how a single extreme value can significantly increase the perceived variability in the dataset. For a data analyst, such dispersion signals the need for closer inspection, possibly indicating outliers or suggesting that the variable needs transformation before drawing conclusions or comparisons.

Informational note: While mathematically useful, variance is often less intuitive in its raw form because it's in “squared units”, like measuring distance in “square miles” instead of miles.

Standard deviation

Standard deviation is the square root of the variance. Since variance is the square of the standard deviation, both metrics measure the same thing: the spread of the data. However, the standard deviation is often more intuitive because it’s expressed in the same units as the data, making it easier to interpret. Here is the formula to calculate standard deviation:

In the example above, the standard deviation is 6.162.48√6.16 ≈ 2.48. This means that, on average, customer ratings deviate about 2.482.48 points from the mean. For a data analyst, standard deviation shows how much values vary. A higher value means less consistency, which can skew how we interpret customer satisfaction or compare results.

Range

Range is the difference between the highest and lowest value. Using another set of ratings—1,2,3,4,51, 2, 3, 4, 5—the range is 51=45 − 1 = 4. While simple, it gives a quick sense of how spread out the data is. However, it can be overly influenced by outliers.

How to find outliers using Interquartile Range (IQR)

The IQR helps us understand the central 50% of values in a dataset by ignoring the lowest and highest extremes. It’s especially useful when the data is skewed or contains outliers, as it highlights where most values actually lie. Let’s walk through an example using the dataset: [1,2,3,4,10][1, 2, 3, 4, 10]. We use the following steps to calculate the IQR:

  • Sort the data: 1,2,3,4,101, 2, 3, 4, 10.

  • Find the median (Q2): The middle value is 33.

  • Find the first quartile (Q1): Look at the lower half [1,2][1, 2]. Since there are two numbers, we take their average to find the median: (1+2)÷2=1.5(1 + 2) ÷ 2 = 1.5

  • Find the third quartile (Q3): Look at the upper half [4,10][4, 10]. Again, with two numbers, the median is the average: (4+10)÷2=7(4 + 10) ÷ 2 = 7

  • Calculate the IQR: Q3Q1=71.5=5.5Q3 − Q1 = 7 − 1.5 = 5.5

The middle 50% of the data lies between 1.51.5 and 77. The IQR of 5.55.5 reflects a moderately spread out middle range. Even though the maximum value (1010) is relatively high, it doesn’t distort the IQR, which is why it’s so useful when dealing with outliers or skewed data.

Example

In this example, we’ll calculate above discussed metrics from the same list of customer ratings. This helps us understand how consistent the review scores are.

Python
import pandas as pd # Import pandas library for data handling
# Create a pandas Series to hold the ratings data
ratings = pd.Series([5, 4, 4, 3, 5, 2, 5, 1, 3, 4])
# Calculate the standard deviation: average amount data varies from the mean
std_dev = ratings.std()
# Calculate variance: the square of the standard deviation, showing spread in squared units
variance = ratings.var()
# Calculate the range: difference between max and min values in the data
data_range = ratings.max() - ratings.min()
# Calculate the Interquartile Range (IQR): difference between 75th percentile and 25th percentile
# This captures the spread of the middle 50% of the data, robust to outliers
iqr = ratings.quantile(0.75) - ratings.quantile(0.25)
# Print the results with two decimal places for clarity
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Variance: {variance:.2f}")
print(f"Range: {data_range}")
print(f"IQR: {iqr}")

The output shows four key measures of how the customer ratings vary:

  • Standard deviation(1.351.35): On average, the ratings differ from the mean by about 1.351.35 points, indicating moderate variability.

  • Variance(1.821.82): This is the squared standard deviation, reflecting overall spread but in squared units.

  • Range(44): The difference between the highest (55) and lowest (11) ratings shows the total spread, but sensitive to outliers.

  • Interquartile range(1.751.75): The spread of the middle 5050% of ratings, a robust measure less affected by extreme values.

Together, these metrics help us understand how consistent or varied the customer ratings are.

Pandas data snapshots

After exploring central tendency and dispersion, the next step is to summarize our data efficiently. Before diving into deeper analysis, we need a quick look at how values are distributed and which categories are most frequent.

Pandas offers quick ways to summarize data. Here, we discuss two powerful functions: value_counts() for categories and describe() for numbers, giving us a clear snapshot of the data at a glance.

  • value_counts() is perfect for categorical data; it counts the frequency of each unique category, helping us see which categories dominate or are rare.

  • describe() provides a comprehensive summary of numerical variables, including count, mean, standard deviation, minimum, quartiles, and maximum values. This statistical overview complements the central tendency and dispersion measures we’ve learned.

Fun fact: describe() is like the “executive summary” of our data, providing key takeaways without drowning us in details!

These tools are our go-to for quick and effective data summaries, letting us grasp the big picture before diving deeper. Here’s a simple example using both methods:

Python 3.10.4
import pandas as pd
# Sample data
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C', 'B'],
'Value': [10, 15, 10, 20, 15, 10, 25, 20, 15]
}
df = pd.DataFrame(data)
# Using .value_counts() on categorical column
category_counts = df['Category'].value_counts()
print("Category counts:\n", category_counts)
# Using .describe() on numerical column
value_stats = df['Value'].describe()
print("\nValue summary statistics:\n", value_stats)

In output:

  • Category counts: Each category (A, B, C) appears 33 times in the dataset, as shown by .value_counts() on the Category column.

  • Value summary statistics: The Value column has 99 entries with an average (mean) of 15.5615.56, ranging from 1010 to 2525. The quartiles (25%25\%, 50%50\%, 75%75\%) show the distribution spread across 1010, 1515, and 2020 respectively.

By using value_counts() and describe() together, we gain fast and effective insights into the structure of our dataset: who’s in the data, and how values behave. These quick checks set the stage for deeper exploration and analysis.

Understanding the shape of data distributions

Understanding how data is distributed helps analysts spot patterns, identify unusual shapes, and decide if any adjustments, like transformations, are needed before deeper analysis. Distribution tells us whether values are clustered, spread out, or lean to one side. This is a key part of univariate analysis, where we examine a single variable in isolation to understand its behavior. Two key concepts help here:

  • Skewness of the distribution

  • Kurtosis

Skewness of the distribution

Skewness measures how much a distribution deviates from perfect symmetry around its center. It’s a key part of univariate analysis, helping us understand how a single variable behaves, whether it leans to the left or right. It also helps determine whether this asymmetry might distort statistical summaries, like the mean or affect model performance.

Right and left skewness

A skewness value close to 00 indicates approximate symmetry. Negative values indicate left skew (a longer tail on the left), while positive values indicate right skew (a longer tail on the right).

Let’s explore each case in detail:

1. Right skewness (positive skew)

In a right skewed distribution, the tail on the right side is longer than the left. This indicates that there are a few unusually high values pulling the mean to the right. Most data points are concentrated on the left side of the distribution, closer to the lower values. A common example is income distribution: while the majority earns moderate incomes, a few individuals with extremely high salaries stretch the distribution. As a result, the mean is greater than the median, which in turn is greater than the mode (Mean > Median > Mode).

Tail on the right: mean greater than median
Tail on the right: mean greater than median

2. Left skewness (negative skew)

A left-skewed distribution has a longer tail on the left side. This means that there are a few unusually low values pulling the mean downward, while the majority of data points cluster toward the higher end. The mean is less than the median, and the median is less than the mode (Mean < Median < Mode). An example of this skewness is the age at retirement, where most people retire around a common age, but a few retire much earlier, skewing the data to the left.

Tail on the left: mean less than median
Tail on the left: mean less than median

3. Zero skew (symmetrical)

When a dataset is perfectly symmetrical, it has zero skewness. This means the data is evenly distributed around the mean, and the left and right tails of the distribution are mirror images. In this case, the mean, median, and mode are all equal (Mean = Median = Mode). A classic example is a normal distribution or bell curve, such as IQ scoresIQ scores are a standardized way to measure cognitive abilities, and they follow a bell curve where most people score near the average. in a large population where most values are clustered around the average ,with equal spread on both sides.

Balanced data: equal mean, median, and mode
Balanced data: equal mean, median, and mode

Fun fact: Symmetrical data is perfectly balanced, like a seesaw with equal weight on both sides.

Calculate skewness

Now that we’ve explored how skewness describes the asymmetry of a distribution, let’s put that understanding into practice. We’ll use Python’s scipy.stats.skew() function to calculate skewness for a simple dataset.

Python 3.10.4
from scipy.stats import skew
# Sample dataset (e.g., customer response times)
data = [12, 15, 14, 10, 18, 19, 21, 20, 24, 30, 45, 50, 55, 60, 65]
# Calculate skewness (bias-corrected)
skew_val = skew(data, bias=False)
print("Skewness:", skew_val)

Here, we set bias=False to correct for small-sample bias, ensuring that our estimates are more accurate, especially when working with limited data.

In output, the skewness value is approximately 0.750.75, indicating a moderate positive skew. This means the distribution has a longer right tail, suggesting that while most values cluster around the center, there are a few larger values pulling the average upward.

In practical terms, this could reflect a situation like customer response times where most people respond promptly, but a small number take significantly longer, introducing asymmetry into the distribution.

What is kurtosis?

When we look at data, it’s not enough to know the average or how spread out the values are. We also want to understand how the values are distributed, especially in the tails (the extremes). That’s where kurtosis comes in.

Kurtosis tells us how sharp the peak of a distribution is, and how heavyIndicate a higher likelihood of extreme values compared to a normal distribution. or lightSuggest fewer extreme values, with data more tightly concentrated around the mean. the tails are compared to a normal bell-shaped curve. In simple terms, it helps us understand how likely we are to see extreme values (outliers) in the data.

  • A kurtosis around 3 indicates a mesokurtic distribution. It is similar to the normal curve, with moderate tails and peak.

  • A value greater than 3 signals a leptokurtic shape. It has sharper peak and heavier tails, meaning more extreme values.

  • A value less than 3 points to a platykurtic shape. It has a flatter peak and lighter tails, meaning fewer outliers.

Let’s take a closer look at each type of kurtosis distribution:

1. Leptokurtic

A leptokurtic distribution has positive kurtosis. Leptokurtic distributions are characterized by tall, sharp peaks and heavy tails. This means that the data points are heavily concentrated near the mean, but there are also more extreme values or outliers that deviate from what a normal distribution looks like. The presence of heavy tails makes this type of distribution riskier in real-world scenarios like finance, where sudden large gains or losses can occur. Leptokurtic data suggests a high likelihood of rare, but significant deviations from the average.

Sharp peak with heavy tails
Sharp peak with heavy tails

2. Mesokurtic

A mesokurtic distribution has a moderate peak and tails, similar to that of a normal distribution. It indicates a balanced dataset without an excess of extreme values or outliers. This type of distribution is commonly seen in naturally occurring variables like IQ scores, where most data points lie close to the mean, and extreme scores are relatively rare. Mesokurtic is considered the “baseline” against which other kurtosis types are compared.

Moderate peak and tails
Moderate peak and tails

3. Platykurtic

Platykurtic distributions have negative kurtosis. Compared to a normal distribution, platykurtic distributions have flatter peaks and thinner tails. This suggests that the data is more evenly spread out, with fewer extreme values or outliers. While this might indicate a lesser risk of outliers in applications like manufacturing or quality control, it can also mean the data lacks strong central tendencies. An example could be a uniform distribution, where all values occur with roughly the same frequency:

Flat peak with light tails
Flat peak with light tails

When analyzing data, it’s crucial to understand not just the center but also the shape of the tails along with how likely extreme values or outliers are. Kurtosis measures this tailedness and indicates the likelihood of rare, extreme events.

To better compare distributions, excess kurtosis is used, which subtracts 3 from the kurtosis value. This helps us see if the data is more or less prone to extreme outliers compared to a normal distribution.

💡 Excess kurtosis = Kurtosis—3
It helps to compare this against normal distribution:

  • Excess > 0: More extreme outliers

  • Excess < 0: Less prone to outliers

Calculate kurtosis

Now that we’ve seen how kurtosis reflects the shape and tails of a distribution, let’s move from theory to practice. We’ll use Python’s scipy.stats.kurtosis() function to calculate kurtosis for a simple dataset.

The kurtosis() function offers two ways to interpret results. First, with fisher=True parameter, it returns excess kurtosis, where a normal distribution has a baseline of 0. This makes it easier to compare tailedness directly against the normal curve. Second, with fisher=False, it gives the Pearson kurtosis, where a normal distribution has a baseline of 3; this is the traditional kurtosis value.

Python 3.10.4
from scipy.stats import kurtosis
# Sample dataset (e.g., customer purchase amounts)
data = [12, 15, 14, 10, 18, 19, 21, 20, 24, 30, 45, 50, 55, 60, 65]
# Calculate excess kurtosis (Fisher’s definition, normal = 0)
excess_kurt = kurtosis(data, fisher=True, bias=False)
print("Excess kurtosis:", excess_kurt)
# Calculate Pearson kurtosis (normal = 3)
pearson_kurt = kurtosis(data, fisher=False, bias=False)
print("Pearson kurtosis:", pearson_kurt)

The excess kurtosis value is 1.071-1.071, which is less than 00, indicating a platykurtic distribution. This means the data has flatter peaks and lighter tails than a normal distribution, implying fewer extreme values or outliers.

The Pearson kurtosis is 1.9291.929, also below the normal benchmark of 33. This confirms the same conclusion, that the distribution is relatively flat and has a lesser risk of extreme deviations.

Wrap up

Univariate analysis is the foundational step in understanding data, where we inspect one variable at a time to uncover its typical values, spread, and shape. This process helps us detect outliers, identify skewness, and grasp the variability of the data. We use measures of central tendency, like mean, median, and mode, to understand where the data tends to cluster. When combined with measures of dispersion, such as variance, standard deviation, range, and interquartile range, we gain a fuller picture of how each variable behaves. Skewness and kurtosis further refine our understanding of distribution shape, which is crucial for subsequent analysis decisions. Gaining proficiency in univariate analysis allows us to prepare data thoughtfully and build intuition for deeper, multivariate exploration.

Quiz

1.

Which of the following is a measure of central tendency?

A.

Variance

B.

Mean

C.

Range

D.

Skewness


1 / 5