...

/

Unveil the Numbers and Uncover Insights

Unveil the Numbers and Uncover Insights

Learn to explore each variable individually to uncover insights, detect outliers, and understand data distribution.

Before building models or exploring relationships between variables, it’s important to understand each variable independently. This process is called univariate analysis, which examines one column at a time to understand its distribution, central tendency, variability, and any unusual values.

You looked at your dataset's structure and data types in earlier steps. Now, you’ll dig deeper into each variable to understand what it can tell you. You’ll look at its typical values, how much it varies, and whether it contains any outliers or patterns that might affect your later analysis.

In this lesson, you’ll learn how to apply univariate statistics to explore individual columns. You’ll measure center (mean, median), spread (range, standard deviation), and shape (skewness, kurtosis) to build a clearer picture of your data, one variable at a time.

What is univariate analysis?

Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, central tendency (the typical value), and variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables—just the characteristics of one variable by itself.

Press + to interact
Analyzing a single column
Analyzing a single column

Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.

Descriptive statistics

In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.

Central tendency

Central tendency metrics describe where the center of the data lies. For a data scientist, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product?. There are three key measures of central tendency:

Press + to interact
Central tendency measures
Central tendency measures

Let’s start with the most familiar one—the mean.

Mean

Mean is the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.

For example, if five customers give ratings of 3,4,4,5,103, 4, 4, 5, 10, the mean is:

That may sound fine at first, but look closely, most ratings are around 4, yet the average is over 5. Why? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.

Median

Median is the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.

Using the same five ratings—3,4,4,5,103, 4, 4, 5, 10, let’s sort them: 3,4,4,5,103, 4, 4, 5, 10. The median is the middle value, which is 4 in this case.

Notice how the high rating of 1010 doesn’t affect the median at all. That’s why median is often a better measure when our data has outliers. It gives a more stable picture of what’s typical.

Mode

Mode is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again with the same ratings: 3,4,4,5,103, 4, 4, 5, 10. The mode is the most frequent value—and here, it’s 4 because it appears twice.

The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.

Example

To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas, and gain a better understanding of what each one tells us.

Press + to interact
Python 3.10.4
import pandas as pd
ratings = pd.Series([5, 4, 4, 3, 5, 2, 5, 1, 3, 4])
mean = ratings.mean()
median = ratings.median()
mode = ratings.mode().iloc[0]
print(f"Mean: {mean}, Median: {median}, Mode: {mode}")
  • ratings.mean(): Calculates the arithmetic average of all ratings.

  • ratings.median(): Finds the middle value when the ratings are sorted.

  • ratings.mode().iloc[0]: Returns the most frequent rating(s) in the Series and selects the first one using .iloc[0].

In output, the mean of 3.63.6 gives us the overall average rating, but it's slightly pulled down by lower scores. The median and mode both point to 44, suggesting that most customers rated the product around that level.

Dispersion

While central tendency helps us understand the most common or typical values in a dataset, dispersion shows how much the data varies. This is crucial for scientist trying to assess consistency or variability in customer behavior, product performance, or operational metrics. In other words: Are most ratings similar, or is there a lot of fluctuation?

There are several key measures of dispersion:

Variance

...