...

/

Patterns in Many

Patterns in Many

Learn how to analyze relationships involving more than two variables to answer complex questions and uncover deeper insights.

We’ve seen how analyzing two variables helps uncover relationships, like how height relates to weight or how sex affects purchasing behavior. But sometimes, those two-variable relationships tell only part of the story. Other variables may also be influencing the outcome.

  • What if weight depends on both height and age?

  • What if customer satisfaction varies by product type, price, and location?

These kinds of questions demand multivariate analysis, where we study how three or more variables interact. It helps us explore complex patterns, uncover hidden structures, and build better predictive models.

What is multivariate analysis?

Multivariate analysis is all about examining relationships between three or more variables simultaneously. It helps us make sense of complex data where simple one-to-one comparisons don’t tell the full story.

Fun fact: Think of multivariate analysis like solving a mystery with multiple clues, not just one or two. The more clues (variables) we consider, the closer we get to the truth!

With multivariate techniques, we can answer questions like:

  • How do multiple factors, say age, income, and education, together influence someone’s likelihood to make a purchase?

  • Are there distinct groups or clusters hidden within our dataset?

These techniques are especially powerful in real-world analysis, where data is rarely simple. By considering multiple variables together, we get a more complete, nuanced picture of what’s going on.

Visualizing multivariate relationships

Choosing the right tools to visualize multivariate relationships is key. In this section, we’ll focus on two commonly used approaches:

  1. Colored scatter plots (with size and hue): Add extra variables to a basic scatter plot by mapping color and marker size to additional dimensions.

  2. Correlation heatmaps: Display pairwise correlation coefficients in a grid, making it easy to spot strong or weak relationships at a glance.

We’ll now examine each of these approaches in action to see how they reveal insights that simple plots might miss.

1. Colored scatter plot with size and hue

The colored scatter plot is an enhanced version of the basic scatter plot that lets us visualize more than two variables at once. Just like a standard scatter plot, each point represents an individual observation, with one variable on the x-axis and another on the y-axis. What makes this plot special is the use of color (hue) and marker size (size) to represent additional dimensions.

This allows us to explore questions like:

  • Do patterns vary across different categories or groups (like species or regions)?

  • How does a third variable influence the relationship between two others?

  • Are larger values of a variable concentrated in certain areas of the plot?

Let’s walk you through an example from the Iris dataset. We’ll create a scatter plot where the x-axis represents sepal length and the y-axis shows sepal width. Each point is colored based on the flower species, adding a categorical distinction through hue. To bring in a fourth variable, we vary the size of each point according to petal length. This multivariate scatter plot helps us see how different species of flowers cluster based on their sepal measurements, while also giving us a sense of how petal length varies within those groups.

Press + to interact
Iris scatter plot showing sepal dimensions with species as hue and petal length as size
Iris scatter plot showing sepal dimensions with species as hue and petal length as size

We can observe how:

  • Setosa flowers (dark blue) cluster in a distinct region with higher sepal width.

  • Versicolor and virginica overlap more but differ in size, reflecting petal length.

  • Larger markers (indicating longer petals) appear more in the virginica group.

Fun fact: Multivariate scatter plots are like data storytelling with visuals; colors, sizes, and positions all combine to reveal hidden patterns and groupings!

2. Correlation heatmap

A correlation heatmap is a powerful tool for examining the relationships between multiple numeric variables at once. Unlike a scatter plot that shows the relationship between just two variables, a heatmap presents a grid of pairwise correlations, making it easy to spot patterns across an entire dataset.

Each cell in the heatmap shows the correlation coefficient (ranging from 1-1 to 11), which tells us:

  • Positive correlation (closer to +1+1 ...