Connect the Dots
Learn how to explore and visualize relationships between variables in the data using bivariate analysis techniques.
We'll cover the following...
We explored individual variables: looking at their distributions, central tendencies, and spread. That gave us a solid understanding of what each variable looks like. But in the real-world, variables often interact with each other. To answer questions like:
Does more time on a website lead to more purchases?
Is income influenced by education level?
Do taller people tend to weigh more?
We need to move beyond solo stats and explore how variables interact. This brings us to bivariate analysis, which helps us understand relationships between variables.
Bivariate analysis
When working with datasets, we often must examine how two variables relate. This is known as a bivariate relationship. Understanding these relationships helps us uncover meaningful patterns, identify key associations, and generate insights that inform decisions.
The type of analysis depends on the nature of the variables, whether they are numeric, categorical, or a mix of both. In this section, we’ll look at the main types of bivariate relationships, see how to summarize them with the right statistics, and learn how to interpret them visually.
Let’s start with the simplest case: when both variables are numeric.
Numeric vs. numeric
When both variables in our dataset are numeric, we’re often interested in whether a change in one variable corresponds to a change in the other. These relationships are fundamental in data analysis, because they help us understand how two continuous measurements vary.
For example, we might ask: if someone is taller, do they also tend to weigh more? If a company increases its marketing budget, do its sales improve? This type of relationship is at the heart of what’s known as bivariate numerical analysis.
To study this, we typically start with two tools: numerical summarization and visual inspection.
In visual inspection, we’ll focus on interpreting the plots; generating them in Google Sheets will come later in the “Show What You Found” chapter.
Quantifying the relationship
In data analysis, one of the most fundamental questions we ask is: To what extent do two numeric variables move together? This question lies at the heart of generating insights, identifying behavioral drivers, and revealing hidden patterns in the data. The Pearson correlation coefficient offers a precise, quantitative answer. It measures the strength and direction of a linear relationship between two continuous variables, helping analysts interpret associations with clarity and confidence.
Pearson correlation coefficient (r)
The Pearson correlation coefficient (r) measures both the strength and direction of the linear relationship between two numeric variables. In this context, we usually think of one as the independent variable (the factor we believe may be influencing the other) and the second as the dependent variable, which reflects the outcome or response being influenced. Its value always falls between
Where:
, are the individual data points. , are the means of the two variables. , are the standard deviations of the variables. is the number of observations.
The result is bounded between
If
, it shows a perfect positive linear relationship. Every increase in one variable is associated with a proportional increase in the other. For example, if height increases, weight increases at a consistent rate. All points lie exactly on an upward-sloping straight line. If
, it shows a perfect negative linear relationship. Every increase in one variable is associated with a proportional decrease in the other. For instance, if the temperature rises, heating costs consistently decrease. All points lie on a downward-sloping straight line. If
, it implies that there is no linear relationship. There’s no linear association between the variables. The data points appear scattered without any directional trend. This does not indicate no relationship between the variables; it could be categorized as non-linear.
Calculating correlation in Google Sheets
In Google Sheets, we can calculate it using the formula: =PEARSON(data_y, data_x)
. The function takes two parameters, as mentioned below.
data_y
is the range of values for the dependent variable.data_x
is the range of values for the independent variable.
Let’s explore how the Pearson correlation works using a simple dataset that records individuals’ Height
and Weight
. This example demonstrates how to compute, and interpret the correlation coefficient in Google Sheets.
A value of Height
and Weight
. As Height
increases, Weight
tends to increase proportionally. This high correlation suggests that the relationship is nearly linear, and consistent across the observed values.
Visual inspection
While correlation helps us understand how strongly two variables move together, it can mask very different patterns. For example, Anscombe’s quartet (a famous set of four datasets) shares identical summary statistics (means =
Analyst tip: Don’t skip visuals. A quick histogram or scatterplot often reveals issues or insights that statistics alone can’t.
Let’s explore scatterplots to understand how we visualize relationships between variables.
Scatterplot
The scatterplot is one of the most fundamental tools for examining the relationship between two numeric variables. Each point on the plot represents an individual observation, where the horizontal (
Scatterplots help us answer questions like:
Does one variable increase as the other increases?
Are the points tightly clustered (indicating a strong relationship) or widely scattered (suggesting a weak one)?
Is the pattern linear, curved, or more complex?
For example, the scatterplot below shows Height
(cm) on the x-axis and Weight
(kg) on the y-axis. Each dot represents an individual’s height and corresponding weight:
We can observe a clear upward trend in the points, indicating that as Height
increases, Weight
tends to increase as well. This positive linear relationship suggests that the two variables are positively correlated.
Simple linear regression
While correlation highlights how strongly two numeric variables are linearly related, simple linear regression (SLR) takes it a step further; it helps us model that linear relationship and even predict the value of one variable based on the value of the other. It answers the question, “If I know X, what can I expect Y to be?”
If correlation tells us if two variables are friends, simple linear regression tries to draw the “friendship line” and predict where that friendship might go. It’s like predicting how much a plant will grow based on how much water it receives.
Think of it as finding the “line of best fit” through the points on a scatterplot. This line helps us understand the typical change in one variable when the other changes.
The equation for a simple linear regression model is:
Where:
is the dependent variable (the variable we want to predict or explain, e.g., weight). is the independent variable (the variable we use to predict, e.g., height). is the intercept (the predicted value of when is ). Sometimes, this might not make practical sense, but it’s part of the mathematical equation. is the slope (this tells us how much is expected to change for every one-unit increase in ). Epsilon (ε, Ε)) represents the error term or residual, which accounts for the variation in
that cannot be explained by .
Predicting values in Google Sheets
Using our Height
and Weight
data, we can build a simple linear regression model to predict a person’s weight based on their height. Google Sheets provides the FORECAST(x, data_y, data_x)
function that performs the prediction automatically. It takes the parameters mentioned below.
x
: The value of the independent variable for which we want to predict the dependent variable.data_y
: The range of known dependent variable values (the outcomes we want to predict).data_x
: The range of known independent variable values (the inputs corresponding todata_y
).
To predict the Weight
for a new Height
of 172
cm, use the following formula:
This will return the predicted Weight
directly: 68.90 kg
.
Now that we’ve thoroughly explored how numeric variables interact, let’s shift our focus to understanding relationships when both variables are categorical.
Categorical vs. categorical
In many real-world scenarios, we deal with two categorical variables and want to explore whether there’s an association between them. For example:
Are men more likely than women to make a purchase?
Do customer segments vary by region?
Is product return behavior related to payment method?
Understanding such relationships helps with segmentation, targeting, and behavioral analysis. To explore these relationships, we use cross-tabulation (or contingency tables).
Cross-tabulation
A cross-tabulation shows how often combinations of category values occur. It’s a simple yet powerful tool that lays the groundwork for pattern detection. It presents a matrix (or table) where:
Rows represent the categories of one variable.
Columns represent the categories of the second variable.
Cells show the frequency (or proportion) of each combination.
Imagine we have data on Sex
and Purchase
behavior. A crosstab might look like this:
Sex | Purchase: Yes | Purchase: No |
Male | 4 | 1 |
Female | 2 | 3 |
Each cell tells us how often a particular combination occurs:
4 males purchased.
3 females did not purchase.
Example
To analyze the relationship between two categorical variables in Google Sheets, we can use a pivot table, a tool we’re already familiar with from our last chapter. It’s the most powerful way to create a cross-tabulation table and see the frequency of each combination.
Creating a cross-tabulation
First, enter the data into two columns in a Google Sheet. Make sure each column has a header.
Next, select the entire data range (A1:B11
). Then, go to the top menu and click “Insert” > “Pivot table.”
In the Pivot table editor that appears on the right, we will configure our cross-tabulation table.
Rows: Click “Add” next to “Rows” and select
Sex
.Columns: Click “Add” next to “Columns” and select
Purchase
.Values: Click “Add” next to “Values” and select
Purchase
again. In the “Summarize by” drop-down, make sure it is set toCOUNTA
to get a count of each sex-purchase combination.
This will generate a cross-tabulation table, showing the raw counts:
If one group is larger than the other, raw counts can be misleading. That’s where percentages help. We can organize the crosstab by row to see within-group proportions.
In the pivot table editor, go to the “Values” section, where we have the count of purchases. Click the drop-down arrow next to it, and under the “Show as” option, select the “% of the row.” This updates the table to show the percentage of purchases within each sex group:
Now it’s easier to see:
Now that we’ve explored how to analyze relationships between two categorical variables, let’s shift our focus to situations where we have a mix, one numeric and one categorical variable.
Numeric vs. categorical
When working with datasets that include both categorical (e.g., grade, region, product type) and numerical variables (e.g., income, sales, age), a common analysis task is to examine how a numeric variable behaves across different categories. This type of relationship helps reveal disparities, patterns, or outliers within subgroups of the data.
To compare numeric variables across categories, we use a number of techniques mentioned below.
Grouped summary statistics: These provide numerical insights such as mean, median, standard deviation, and quartiles, grouped by category.
Candlestick chart: Visualizes the low, high, opening, and closing values of a numeric variable over a specific period or across categories. It’s commonly used to track fluctuations, such as daily stock prices, and makes it easier to spot patterns, variations, and outliers.
Bar chart of aggregated values: Displays the average, sum, or other aggregate of a numeric variable for each category, helping identify relative differences quickly.
We can find grouped summary statistics in Google Sheets using a pivot table, which we’re already familiar with from our last chapter. Simply put the categorical variable in the “Rows” section, and the numeric variable in the “Values” section. From there, we can change the aggregation from SUM
to other metrics we’ve learned, like AVERAGE
, MEDIAN
, or STDEV
, to see how the numeric data behaves within each group.
Now that we know how to get these numerical summaries, let’s explore an effective way to visualize these differences with a candlestick chart.
Candlestick chart
A candlestick chart is a great way to visualize the spread of a numeric variable across categories. For example, we can track fluctuations in stock prices, rainfall, temperatures, or sales across different regions.
To create this chart, we arrange our data in columns as mentioned below.
First column: Enter a label for the x-axis (e.g., the category).
Second column: Enter the maximum (high) value. This forms the top of the candle’s center line.
Third column: Enter the opening (initial) value. This is one vertical border of the candle.
Fourth column: Enter the closing (final) value. This is the second vertical border of the candle.
Fifth column: Enter the minimum (low) value. This forms the base of the candle’s center line.
Rows: Each row represents a single candlestick marker.
We can now look at a sample dataset of daily stock prices alongside its candlestick chart to see how the values are organized, and represented visually.
The chart shows boxes indicating the range between opening and closing prices, as well as lines representing the minimum and maximum for each Day
. This makes it easy to observe how stock values fluctuated over the week, highlighting daily peaks, dips, and overall movement.
Bar chart
We can also use a bar chart to summarize a numeric variable across categories using an aggregate value, such as the average, sum, or count. This type of chart is especially useful when we want to compare categories at a glance, as each bar represents a single numeric value and differences between categories are immediately visible.
In the following weekly rainfall example, we first calculate the Average Rainfall
for each Day
by taking the mean of the daily measurements and then generate the bar chart to visualize these values.
The bar chart shows one bar per Day
, representing the Average Rainfall
. This allows us to quickly compare days, and identify which day had the highest or lowest rainfall.
Wrap up
In this lesson, we explored how two variables relate to each other using tools like scatterplots, Pearson correlation, grouped statistics, candlestick, and bar charts. We also introduced simple linear regression to move beyond patterns and into prediction, using a line to model and interpret numeric relationships.
These methods are key to uncovering insights in exploratory data analysis. They help us compare groups, measure associations, and even predict one variable from another. This builds a strong foundation for deeper analytical work.