Regression Confession
Learn how regression tests explain relationships, and predict outcomes in data.
We'll cover the following...
As data analysts, we often try to make sense of patterns in data and explain them in a way others can understand. People will ask questions like: Does income affect purchases? Does battery size predict phone longevity?
We need a reliable method to move beyond simply describing trends and actually test relationships. This is where a regression test comes in.
What is a regression test?
A regression test is a statistical approach that helps us:
Model relationships between variables.
Measure the strength and direction of those relationships.
Test hypotheses about cause and effect.
Predict future outcomes based on past data.
Instead of relying on assumptions or intuition, regression lets us test claims like: “Does income significantly influence purchase decisions?”
Types of regression tests
Different regression tests are suited for different kinds of outcomes. The test we choose depends on the type of variable we’re trying to predict.
Regression Type | Description | Target Variable | Common Use Cases |
Linear regression | Tests linear relationships with numeric outcomes | Continuous (e.g., price) | Predicting sales, income, weights |
Logistic regression | Tests probability of binary outcomes | Binary (e.g., 0/1) | Churn prediction, marketing conversion |
Multiple regression | Tests multiple inputs at once | Continuous | Controlling for multiple factors |
Poisson regression | Tests relationships with count-based targets | Count (e.g., clicks) | Website visits, event occurrences |
Ordinal regression | Tests effects on ranked outcomes | Ordered categories | Customer satisfaction (low, medium, high) |
In this lesson, we’ll focus on the two most useful regression tests for data analysts: linear regression and logistic regression.
Linear regression
Linear regression is one of the most useful tools when we’re working with numeric outcomes. It’s especially helpful when the variable we want to predict is continuous, like price, revenue, or performance score. It also helps us understand how much each factor contributes to the result.
As analysts, we’re often tasked with more than just reporting values. We’re expected to explain why something happened and what might happen next. That’s where regression tests shine. They let us model relationships between dependent and independent variables, measure how strong those relationships are, and test whether the effects we observe are real or just random noise.
Example
We have data for 15 houses, including their square footage, number of rooms, and the final sale price. Our goal is to understand how square footage and number of rooms together influence the house price. To do this, we set up a statistical hypothesis test that helps us evaluate whether the observed relationships are meaningful or just due to chance:
Null hypothesis (H₀): There is no significant relationship between square footage, number of rooms, and house price.
Alternative hypothesis (H₁): Square footage and number of rooms significantly influence the house price.
import pandas as pdimport statsmodels.api as sm# Load dataset from CSVdf = pd.read_csv("house_prices.csv")# Define independent variablesX = df[['Square_Feet', 'Rooms']]X = sm.add_constant(X) # Add intercept# Define dependent variabley = df['Price']# Fit the regression modelmodel = sm.OLS(y, X)results = model.fit()# Show summaryprint(results.summary())
Line 5: Reads the CSV file named
house_prices.csv
and loads it into a DataFrame calleddf
.This file should contain columns likeSquare_Feet
,Rooms
, andPrice
.Line 8: Selects the independent (predictor) variables from the DataFrame:
Square_Feet
andRooms
.These will be used to predict the price.Line 9: Adds a constant term (a column of 1s) to the predictor variables.This represents the intercept (
...