Unveil the Numbers and Uncover Insights
Learn to explore each variable individually to uncover insights, detect outliers, and understand data distribution.
We'll cover the following...
Before building models or exploring relationships between variables, it’s important to understand each variable independently. This process is called univariate analysis, which examines one column at a time to understand its distribution, central tendency, variability, and any unusual values.
You looked at your dataset's structure and data types in earlier steps. Now, you’ll dig deeper into each variable to understand what it can tell you. You’ll look at its typical values, how much it varies, and whether it contains any outliers or patterns that might affect your later analysis.
In this lesson, you’ll learn how to apply univariate statistics to explore individual columns. You’ll measure center (mean, median), spread (range, standard deviation), and shape (skewness, kurtosis) to build a clearer picture of your data, one variable at a time.
What is univariate analysis?
Univariate analysis is simply the examination of a single variable in isolation. It’s about understanding its distribution, central tendency (the typical value), and variability (how much values differ). At this stage, we’re not concerned with relationships or interactions with other variables—just the characteristics of one variable by itself.
Whether the variable is numerical or categorical, univariate analysis helps us see what we’re dealing with. This clarity is essential: it helps detect early data issues, informs how we preprocess data, and guides the next steps of analysis.
Descriptive statistics
In univariate analysis, we aim to understand the characteristics of a single variable. Descriptive statistics provide a compact summary of that variable’s distribution, offering insights into its central tendency, spread, and shape.
Central tendency
Central tendency metrics describe where the center of the data lies. For a data scientist, this means identifying what’s representative in the dataset. For example: What’s the average customer rating on our product?. There are three key measures of central tendency:
Let’s start with the most familiar one—the mean.
Mean
Mean is the arithmetic average. It’s calculated by summing all values and dividing by the count. It’s sensitive to extreme values (outliers), which can pull the average up or down.
For example, if five customers give ratings of
That may sound fine at first, but look closely, most ratings are around 4, yet the average is over 5. Why? Because that single high rating of 10 pulls the mean upward. This shows how the mean can be influenced by extreme values.
Median
Median is the middle value when the data is sorted. If there’s an odd number of values, it’s the one in the center; if even, it’s the average of the two central numbers. The median is more robust than the mean when dealing with skewed data.
Using the same five ratings—
Notice how the high rating of
Mode
Mode is the most frequently occurring value in the dataset. It helps identify what’s most common, especially useful for categorical or discrete variables. Again with the same ratings:
The mode is especially helpful when we want to know what’s most common, whether we're looking at customer preferences or product categories.
Example
To see these measures in action, we’ll use a simple list of customer review ratings to calculate the mean, median, and mode with pandas, and gain a better understanding of what each one tells us.
import pandas as pdratings = pd.Series([5, 4, 4, 3, 5, 2, 5, 1, 3, 4])mean = ratings.mean()median = ratings.median()mode = ratings.mode().iloc[0]print(f"Mean: {mean}, Median: {median}, Mode: {mode}")
ratings.mean()
: Calculates the arithmetic average of all ratings.ratings.median()
: Finds the middle value when the ratings are sorted.ratings.mode().iloc[0]
: Returns the most frequent rating(s) in the Series and selects the first one using.iloc[0]
.
In output, the mean of
Dispersion
While central tendency helps us understand the most common or typical values in a dataset, dispersion shows how much the data varies. This is crucial for scientist trying to assess consistency or variability in customer behavior, product performance, or operational metrics. In other words: Are most ratings similar, or is there a lot of fluctuation?
There are several key measures of dispersion:
Variance
...