...

/

Build a Predictor

Build a Predictor

Learn to pick the right regression model, fit it using scikit-learn, and make predictions on new data.

Imagine we’re trying to guess a specific number. Not just whether something will happen, but exactly how much or what value it will be like predicting the exact temperature tomorrow, or the precise amount of sales next quarter. In machine learning, when we want to make these kinds of numerical forecasts based on our data, we use a technique called regression. It helps us find the hidden patterns to predict those exact values.

What is regression?

Regression analysis analyzes how independent variables or features correlate with a dependent variable or outcome. It is a predictive modeling approach in machine learning, where an algorithm predicts continuous outcomes. At its core, regression fits a function:

This function captures how one or more input variables (features XX) relate to a continuous target (yy). During training, we show the model many examples of feature vectors XX paired with known outcomes yy. It then adjusts its internal parameters so that f(X)f(X) comes as close as possible to the true yy. Once trained, the model applies this learned function to new feature sets, whether forecasting stock prices, estimating customer lifetime value, or projecting energy demand to produce accurate numerical predictions.

Now that we know what regression is, let’s explore the main types of regression.

Types of regression

Regression has several variations, each suited to different data patterns and problem requirements. A data scientist must understand these differences to pick the right tool for the job. We’ll explore common regression types, from simple linear models to those handling multiple factors and curved relationships in data.

Linear regression

Linear regression models the relationship between one or more input variables (features XX) and a continuous output (yy) by fitting a straight line. It achieves this by finding the slope (θθ) and intercept (bb) that minimize the squared differences between the predicted values and the actual targets.

Press + to interact
A straight line is the best fit for the data points
A straight line is the best fit for the data points

In equation form:

Here, yy represents the predicted output, which is the numerical value we're trying to forecast.XXis the input feature, the data we're using to make that prediction. θθ (theta) is the slope of the line, telling us how much yy changes for every one-unit change in XX. Finally, ...