...

/

Unveiling the Numbers

Unveiling the Numbers

Learn how to explore each variable individually to uncover insights, detect outliers, and understand data distribution.

Imagine we’re handed a complex machine with hundreds of moving parts, all connected and interacting. Before trying to understand how the machine works as a whole, wouldn’t we want to inspect each part individually? See how it moves, how it’s built, and whether it’s functioning properly. That’s exactly what univariate analysis does for our data.

In the last lesson, we briefly examined our data’s structure and types. Now, it’s time to listen more closely to what each column tells us. Each variable holds clues about typical values, spread, unusual observations, and overall shape that guide how we analyze the data.

In this lesson, we’ll focus on univariate statistics: exploring one variable at a time. We’ll measure its center, spread, and shape to understand its underlying distribution and spot any red flags.

What is univariate analysis?

Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, its central tendency (the typical value), and its variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables, just the characteristics of one variable by itself.

Press + to interact
Analyzing a single column
Analyzing a single column

Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.

Informational note: Univariate analysis is the “solo performance” of the data. Each variable gets its moment in the spotlight to reveal its unique characteristics.

Descriptive statistics

In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.

Fun fact: Descriptive statistics are our data’s “elevator pitch”: quick, concise summaries that tell us the most important things in seconds.

Central tendency

Central tendency metrics describe where the center of the data lies. For a data analyst, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product? There are the three key measures of central tendency:

Press + to interact
Central tendency measures
Central tendency measures

Let’s start with the most familiar one: the mean.

Mean

This refers to the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.

For example, if five customers give ratings of 3,4,4,5,103, 4, 4, 5, 10, the mean is:

That may sound fine at first, but look closely: most ratings are around 4, yet the average is over 5. Why is that so? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.

Fun fact: The mean is so sensitive to outliers, it's often called the “average person at a billionaires’ party”, which means that a few extremes can drastically skew the perception!

Median

This refers to the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.

Using the same five ratings,3,4,4,5,103, 4, 4, 5, 10, let’s sort them in the following manner: 3,4,4,5,103, 4, 4, 5, 10. The median is the middle value, which is 4 in this case.

Notice how the high rating of 1010 doesn’t affect the median at all. That’s why median is often a better measure when our data has outliers; it offers a more stable picture of what’s typical.

Fun fact: The median is the fair referee, always finding the true middle, even when some scores are incredibly high or low!

Mode

This is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again, let's identify it with the same ratings as before: 3,4,4,5,103, 4, 4, 5, 10. The mode is the most frequent value, and here, it’s 4 because it appears twice.

The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.

Fun fact: The mode is the “popularity award winner” of our data, literally showing us what’s most common!

Example

To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas. This is so that we can gain a better understanding of what each one tells us.

Press + to interact
Python 3.10.4
import pandas as pd
ratings = pd.Series([5, 4, 4, 3, 5, 2, 5, 1, 3, 4])
mean = ratings.mean()
median = ratings.median()
mode = ratings.mode().iloc[0]
print(f"Mean: {mean}, Median: {median}, Mode: {mode}")
  • ratings.mean(): Calculates the arithmetic average of all ratings.

  • ratings.median(): Finds the middle value when the ratings are sorted.

  • ratings.mode().iloc[0]: Returns the most frequent rating(s) in the Series and selects the first one using .iloc[0].

In output, the mean of 3.63.6 gives us the overall average rating, but it’s slightly pulled down by lower scores. The median and mode both point to 44, suggesting that most customers rated the product around that level.

Dispersion

While central tendency helps us understand the most common or typical values in a dataset, dispersion shows how much the data varies. This is crucial for analysts trying to assess consistency or variability in customer behavior, product performance, or operational metrics. In other words: Are most ratings similar, or is there a lot of fluctuation?

Fun fact: If central tendency tells us where the target is, dispersion highlights how scattered the arrows are around that target!

There are several key measures of dispersion:

Variance

Variance is a statistical measure that represents the spread or dispersion of data points in a dataset. It tells us how much individual data points differ from the mean of the dataset. In simpler terms, variance gives us an idea of how spread out the values are. Variance is calculated by finding the average of the squared differences between each data point, and the mean.

Here ...