Search⌘ K
AI Features

Unveiling the Numbers

Explore univariate analysis to examine one variable at a time, understanding its central tendency, dispersion, and distribution shape. Learn key statistical measures including mean, median, mode, variance, standard deviation, skewness, and kurtosis. This lesson helps you detect data issues, interpret distributions, and prepare for deeper data analysis.

Imagine we’re handed a complex machine with hundreds of moving parts, all connected and interacting. Before trying to understand how the machine works as a whole, wouldn’t we want to inspect each part individually? See how it moves, how it’s built, and whether it’s functioning properly. That’s exactly what univariate analysis does for our data.

In the last lesson, we briefly examined our data’s structure and types. Now, it’s time to listen more closely to what each column tells us. Each variable holds clues about typical values, spread, unusual observations, and overall shape that guide how we analyze the data.

In this lesson, we’ll focus on univariate statistics: exploring one variable at a time. We’ll measure its center, spread, and shape to understand its underlying distribution and spot any red flags.

What is univariate analysis?

Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, its central tendency (the typical value), and its variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables, just the characteristics of one variable by itself.

Analyzing a single column
Analyzing a single column

Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.

Informational note: Univariate analysis is the “solo performance” of the data. Each variable gets its moment in the spotlight to reveal its unique characteristics.

Descriptive statistics

In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.

Fun fact: Descriptive statistics are our data’s “elevator pitch”: quick, concise summaries that tell us the most important things in seconds.

Central tendency

Central tendency metrics describe where the center of the data lies. For a data analyst, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product? There are the three key measures of central tendency:

Central tendency measures
Central tendency measures

Let’s start with the most familiar one: the mean.

Mean

This refers to the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.

For example, if five customers give ratings of 3,4,4,5,103, 4, 4, 5, 10, the mean is:

That may sound fine at first, but look closely: most ratings are around 4, yet the average is over 5. Why is that so? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.

Fun fact: The mean is so sensitive to outliers, it's often called the “average person at a billionaires’ party”, which means that a few extremes can drastically skew the perception!

Median

This refers to the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.

Using the same five ratings,3,4,4,5,103, 4, 4, 5, 10, let’s sort them in the following manner: 3,4,4,5,103, 4, 4, 5, 10. The median is the middle value, which is 4 in this case.

Notice how the high rating of 1010 doesn’t affect the median at all. That’s why median is often a better measure when our data has outliers; it offers a more stable picture of what’s typical.

Fun fact: The median is the fair referee, always finding the true middle, even when some scores are incredibly high or low!

Mode

This is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again, let's identify it with the same ratings as before: 3,4,4,5,103, 4, 4, 5, 10. The mode is the most frequent value, and here, it’s 4 because it appears twice.

The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.

Fun fact: The mode is the “popularity award winner” of our data, literally showing us what’s most common!

Example

To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas. This is so that we can gain a better understanding of what each one tells us.

Python 3.10.4
import pandas as pd
ratings = pd.Series([5, 4, 4, 3, 5, 2, 5, 1, 3, 4])
mean = ratings.mean()
median = ratings.median()
mode = ratings.mode().iloc[0]
print(f"Mean: {mean}, Median: {median}, Mode: {mode}")
  • ratings.mean() ...