Table of Contents
Understanding the input dataLinear regression modelSelecting featureSplitting dataApplying modelModel validationHow to evaluate regression modelsRegression metrics at a glanceMean Absolute Error (MAE)Mean Squared Error (MSE)Root Mean Squared Error (RMSE)R² ScoreComparing two regression modelsWhich metric should you use?Common evaluation mistakesEvaluating only on training dataComparing models using only one metricIgnoring outliersMisinterpreting R²Calculating regression metrics with scikit-learnFinal takeawayMultiple linear regression model in PythonPolynomial regressionDeploying a regression model with FastAPIBasic deployment workflowSaving a trained regression modelCreating a FastAPI prediction endpointExample API requestExample API responseWhy FastAPI is useful for model servingDeployment best practicesKeep preprocessing and prediction togetherFinal takeawayRegression datasets to practice your machine learning skillsRegression dataset comparisonBeginner-friendly datasetsCalifornia HousingMedical Insurance CostsFish Market DatasetIntermediate regression projectsBike Sharing DemandStudent Performance DatasetAutomobile Price PredictionAdvanced regression projectsAmes Housing PricesEnergy Efficiency DatasetOther multi-feature datasetsWhat to practice with each datasetSuggested learning progressionStep 1: Fish Market DatasetStep 2: California HousingStep 3: Medical Insurance CostsStep 4: Bike Sharing DemandStep 5: Ames HousingCommon regression mistakes when practicingUsing all features without analysisIgnoring train-test separationOverfitting polynomial modelsMisinterpreting correlationUsing MAE alone for evaluationFinal recommendations
How to build machine learning regression models with Python

How to build machine learning regression models with Python

18 mins read
Jun 15, 2026
Share
editor-page-cover

Marvel Comics introduced a fictional character Destiny in the 1980s, with the ability to foresee future occurrences. The exciting news is that predicting future events is no longer just a fantasy! With the progress made in machine learning, a machine can help forecast future events by utilizing the past.

Predicting future
Predicting future

Exciting, right? Let’s start this journey with a simple prediction model. A regression is a mathematical function that defines the relationship between a dependent variable and one or more independent variables. Regression in machine learning analyzes how independent variables or features correlate with a dependent variable or outcome. It serves as a predictive modeling approach in machine learning, where an algorithm predicts continuous outcomes. Rather than delving into theory, the focus will be on creating different models for regression.

Understanding the input data#

Before starting to build a Python regression model, one should examine the data. For instance, if an individual owns a fish farm and needs to predict a fish’s weight based on its dimensions, they can explore the dataset by clicking the “RUN” button to display the top few rows of the DataFrame.

Python
Species Weight V-Length D-Length X-Length Height Width
Bream 290 24 26.3 31.2 12.48 4.3056
Bream 340 23.9 26.5 31.1 12.3778 4.6961
Bream 363 26.3 29 33.5 12.73 4.4555
Bream 430 26.5 29 34 12.444 5.134
Bream 450 26.8 29.7 34.7 13.6024 4.9274
Bream 500 26.8 29.7 34.5 14.1795 5.2785
Bream 390 27.6 30 35 12.67 4.69
Bream 450 27.6 30 35.1 14.0049 4.8438
Bream 500 28.5 30.7 36.2 14.2266 4.9594
Bream 475 28.4 31 36.2 14.2628 5.1042
Bream 500 28.7 31 36.2 14.3714 4.8146
Bream 500 29.1 31.5 36.4 13.7592 4.368
Bream 340 29.5 32 37.3 13.9129 5.0728
Bream 600 29.4 32 37.2 14.9544 5.1708
Bream 600 29.4 32 37.2 15.438 5.58
Bream 700 30.4 33 38.3 14.8604 5.2854
Bream 700 30.4 33 38.5 14.938 5.1975
Bream 610 30.9 33.5 38.6 15.633 5.1338
Bream 650 31 33.5 38.7 14.4738 5.7276
Bream 575 31.3 34 39.5 15.1285 5.5695
Bream 685 31.4 34 39.2 15.9936 5.3704
Bream 620 31.5 34.5 39.7 15.5227 5.2801
Bream 680 31.8 35 40.6 15.4686 6.1306
Bream 700 31.9 35 40.5 16.2405 5.589
Bream 725 31.8 35 40.9 16.36 6.0532
Bream 720 32 35 40.6 16.3618 6.09
Bream 714 32.7 36 41.5 16.517 5.8515
Bream 850 32.8 36 41.6 16.8896 6.1984
Bream 1000 33.5 37 42.6 18.957 6.603
Bream 920 35 38.5 44.1 18.0369 6.3063
Bream 955 35 38.5 44 18.084 6.292
Bream 925 36.2 39.5 45.3 18.7542 6.7497
Bream 975 37.4 41 45.9 18.6354 6.7473
Bream 950 38 41 46.5 17.6235 6.3705
Roach 40 12.9 14.1 16.2 4.1472 2.268
Roach 69 16.5 18.2 20.3 5.2983 2.8217
Roach 78 17.5 18.8 21.2 5.5756 2.9044
Roach 87 18.2 19.8 22.2 5.6166 3.1746
Roach 120 18.6 20 22.2 6.216 3.5742
Roach 0 19 20.5 22.8 6.4752 3.3516
Roach 110 19.1 20.8 23.1 6.1677 3.3957
Roach 120 19.4 21 23.7 6.1146 3.2943
Roach 150 20.4 22 24.7 5.8045 3.7544
Roach 145 20.5 22 24.3 6.6339 3.5478
Roach 160 20.5 22.5 25.3 7.0334 3.8203
Roach 140 21 22.5 25 6.55 3.325
Roach 160 21.1 22.5 25 6.4 3.8
Roach 169 22 24 27.2 7.5344 3.8352
Roach 161 22 23.4 26.7 6.9153 3.6312
Roach 200 22.1 23.5 26.8 7.3968 4.1272
Roach 180 23.6 25.2 27.9 7.0866 3.906
Roach 290 24 26 29.2 8.8768 4.4968
Roach 272 25 27 30.6 8.568 4.7736
Roach 390 29.5 31.7 35 9.485 5.355
Whitefish 270 23.6 26 28.7 8.3804 4.2476
Whitefish 270 24.1 26.5 29.3 8.1454 4.2485
Whitefish 306 25.6 28 30.8 8.778 4.6816
Whitefish 540 28.5 31 34 10.744 6.562
Whitefish 800 33.7 36.4 39.6 11.7612 6.5736
Whitefish 1000 37.3 40 43.5 12.354 6.525
Parkki 55 13.5 14.7 16.5 6.8475 2.3265
Parkki 60 14.3 15.5 17.4 6.5772 2.3142
Parkki 90 16.3 17.7 19.8 7.4052 2.673
Parkki 120 17.5 19 21.3 8.3922 2.9181
Parkki 150 18.4 20 22.4 8.8928 3.2928
Parkki 140 19 20.7 23.2 8.5376 3.2944
Parkki 170 19 20.7 23.2 9.396 3.4104
Parkki 145 19.8 21.5 24.1 9.7364 3.1571
Parkki 200 21.2 23 25.8 10.3458 3.6636
Parkki 273 23 25 28 11.088 4.144
Parkki 300 24 26 29 11.368 4.234
Perch 5.9 7.5 8.4 8.8 2.112 1.408
Perch 32 12.5 13.7 14.7 3.528 1.9992
Perch 40 13.8 15 16 3.824 2.432
Perch 51.5 15 16.2 17.2 4.5924 2.6316
Perch 70 15.7 17.4 18.5 4.588 2.9415
Perch 100 16.2 18 19.2 5.2224 3.3216
Perch 78 16.8 18.7 19.4 5.1992 3.1234
Perch 80 17.2 19 20.2 5.6358 3.0502
Perch 85 17.8 19.6 20.8 5.1376 3.0368
Perch 85 18.2 20 21 5.082 2.772
Perch 110 19 21 22.5 5.6925 3.555
Perch 115 19 21 22.5 5.9175 3.3075
Perch 125 19 21 22.5 5.6925 3.6675
Perch 130 19.3 21.3 22.8 6.384 3.534
Perch 120 20 22 23.5 6.11 3.4075
Perch 120 20 22 23.5 5.64 3.525
Perch 130 20 22 23.5 6.11 3.525
Perch 135 20 22 23.5 5.875 3.525
Perch 110 20 22 23.5 5.5225 3.995
Perch 130 20.5 22.5 24 5.856 3.624
Perch 150 20.5 22.5 24 6.792 3.624
Perch 145 20.7 22.7 24.2 5.9532 3.63
Perch 150 21 23 24.5 5.2185 3.626
Perch 170 21.5 23.5 25 6.275 3.725
Perch 225 22 24 25.5 7.293 3.723
Perch 145 22 24 25.5 6.375 3.825
Perch 188 22.6 24.6 26.2 6.7334 4.1658
Perch 180 23 25 26.5 6.4395 3.6835
Perch 197 23.5 25.6 27 6.561 4.239
Perch 218 25 26.5 28 7.168 4.144
Perch 300 25.2 27.3 28.7 8.323 5.1373
Perch 260 25.4 27.5 28.9 7.1672 4.335
Perch 265 25.4 27.5 28.9 7.0516 4.335
Perch 250 25.4 27.5 28.9 7.2828 4.5662
Perch 250 25.9 28 29.4 7.8204 4.2042
Perch 300 26.9 28.7 30.1 7.5852 4.6354
Perch 320 27.8 30 31.6 7.6156 4.7716
Perch 514 30.5 32.8 34 10.03 6.018
Perch 556 32 34.5 36.5 10.2565 6.3875
Perch 840 32.5 35 37.3 11.4884 7.7957
Perch 685 34 36.5 39 10.881 6.864
Perch 700 34 36 38.3 10.6091 6.7408
Perch 700 34.5 37 39.4 10.835 6.2646
Perch 690 34.6 37 39.3 10.5717 6.3666
Perch 900 36.5 39 41.4 11.1366 7.4934
Perch 650 36.5 39 41.4 11.1366 6.003
Perch 820 36.6 39 41.3 12.4313 7.3514
Perch 850 36.9 40 42.3 11.9286 7.1064
Perch 900 37 40 42.5 11.73 7.225
Perch 1015 37 40 42.4 12.3808 7.4624
Perch 820 37.1 40 42.5 11.135 6.63
Perch 1100 39 42 44.6 12.8002 6.8684
Perch 1000 39.8 43 45.2 11.9328 7.2772
Perch 1100 40.1 43 45.5 12.5125 7.4165
Perch 1000 40.2 43.5 46 12.604 8.142
Perch 1000 41.1 44 46.6 12.4888 7.5958
Pike 200 30 32.3 34.8 5.568 3.3756
Pike 300 31.7 34 37.8 5.7078 4.158
Pike 300 32.7 35 38.8 5.9364 4.3844
Pike 300 34.8 37.3 39.8 6.2884 4.0198
Pike 430 35.5 38 40.5 7.29 4.5765
Pike 345 36 38.5 41 6.396 3.977
Pike 456 40 42.5 45.5 7.28 4.3225
Pike 510 40 42.5 45.5 6.825 4.459
Pike 540 40.1 43 45.8 7.786 5.1296
Pike 500 42 45 48 6.96 4.896
Pike 567 43.2 46 48.7 7.792 4.87
Pike 770 44.8 48 51.2 7.68 5.376
Pike 950 48.3 51.7 55.1 8.9262 6.1712
Pike 1250 52 56 59.7 10.6863 6.9849
Pike 1600 56 60 64 9.6 6.144
Pike 1550 56 60 64 9.6 6.144
Pike 1650 59 63.4 68 10.812 7.48
Smelt 6.7 9.3 9.8 10.8 1.7388 1.0476
Smelt 7.5 10 10.5 11.6 1.972 1.16
Smelt 7 10.1 10.6 11.6 1.7284 1.1484
Smelt 9.7 10.4 11 12 2.196 1.38
Smelt 9.8 10.7 11.2 12.4 2.0832 1.2772
Smelt 8.7 10.8 11.3 12.6 1.9782 1.2852
Smelt 10 11.3 11.8 13.1 2.2139 1.2838
Smelt 9.9 11.3 11.8 13.1 2.2139 1.1659
Smelt 9.8 11.4 12 13.2 2.2044 1.1484
Smelt 12.2 11.5 12.2 13.4 2.0904 1.3936
Smelt 13.4 11.7 12.4 13.5 2.43 1.269
Smelt 12.2 12.1 13 13.8 2.277 1.2558
Smelt 19.7 13.2 14.3 15.2 2.8728 2.0672
Smelt 19.9 13.8 15 16.2 2.9322 1.8792
  • Line 2: pandas library is imported to read DataFrame.

  • Line 6: Read the data from the Fish.txt file with columns defined in line 5.

  • Line 9: Prints the top five rows of the DataFrame. The three lengths define the vertical, diagonal, and cross lengths in cm.

Here, the fish’s length, height, and width are independent variables, with weight serving as the dependent variable. In machine learning, independent variables are often referred to as features and dependent variables as labels, and these terms will be used interchangeably throughout this blog.

Linear regression model#

Linear regression models, a fundamental concept you’ll encounter as you learn machine learning, are widely used in statistics and machine learning. These models use a straight line to describe the relationship between an independent variable and a dependent variable. For example, when analyzing the weight of fish, a linear regression model is used to describe the relationship between the weight yy of the fish and one of the independent variables XX as follows,

y=mX+c.y = m \cdot X + c.

Where mm is the slope of the line that defines its steepness, and cc is the y-intercept, the point where line crosses the y-axis.

Straight line
Straight line

Explore Hands-on Projects on Machine Learning Linear Regression Models

Explore Hands-on Projects on Machine Learning Linear Regression Models

Learn linear regression with hands-on projects

Learn linear regression with hands-on projects

Selecting feature#

The dataset contains five independent variables. A simple linear regression model with only one feature can be initiated by selecting the most strongly related feature to the fish’s Weight. One approach to accomplish this is to calculate the cross-correlation between Weight and the features.

Python
# Finding the cross-correlation matrix
print(Fish.corr())

Ater examining the first column, the following is observed:

  • There is a strong correlation between Weight, and the feature X-Length.
  • The Weight has the weakest correlation with Height.

Given this information, it is clear that if the individual is limited to using only one independent variable to predict the dependent variable, they should choose X-Length and not Height.

# Step 3: Separating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']

Splitting data#

With the features and labels in place, DataFrame can now be divided into training and test sets. The training dataset trains the model, while the test dataset evaluates its performance.

The train_test_split function is imported from the sklearn library to split the data.

Python
from sklearn.model_selection import train_test_split
# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test =
train_test_split(
X, y,
test_size=0.3,
random_state=10,
shuffle=True
)

The arguments of the train_test_split function can be examined as follows:

  • Line 6: Pass the feature and the label.
  • Line 7: Use test_size=0.3 to select 70% of the data for training and the remaining 30% for testing purposes.
  • Lines 8–9: Make the split random and use shuffle=True to ensure that the model is not overfitting to a specific set of data.

As a result, the training data in variables X_train and y_train and test data in X_test and y_test is obtained.

Applying model#

At this point, the linear regression model can be created.

Python
from sklearn.linear_model import LinearRegression
# Step 5: Selecting the linear regression method from scikit-learn library
model = LinearRegression().fit(X_train, y_train)
  • Line 1: The LinearRegression function from sklearn library is imported.
  • Line 4: Creates and train the model using the training data X_train and y_train.

Model validation#

Remember, 30% of the data was set aside for testing. The Mean Absolute Error (MAE) can be calculated using this data as an indicator of the average absolute difference between the predicted and actual values, with a lower MAE value indicating more accurate predictions. Other measures for model validation exist, but they won’t be explored in this context.

Here’s a complete running example, including all of the previously mentioned steps mentioned above to perform a linear regression.

Python
# Step 1: Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Step 2: Defining the columns of and reading the DataFrame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Step 3: Seperating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']
# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)
# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating the trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))

In this instance, the model.predict() function is applied to the training data on line 23, and on line 26, it is used on the test data. But what does it show?

Essentially, this approach demonstrates the model’s performance on a known dataset when compared to an unfamiliar test dataset. The two MAE values suggest that the predictions on both train and test data are similar.

Note: It is essential to recall that the X-Length was chosen as the feature because of its high correlation with the label. To verify the choice of feature, one can replace it with the Height on line 12 and rerun the linear regression, then compare the two MAE values.

How to evaluate regression models#

Building a regression model is only half of the machine learning process. The other half is determining whether the model actually makes accurate predictions on data it has never seen before.

This is where evaluation metrics become important. Different regression metrics measure prediction quality in different ways. Some focus on average error, others penalize large mistakes, and some evaluate how much of the underlying pattern in the data the model captures. Understanding these metrics helps you choose better models and avoid misleading conclusions.

Regression metrics at a glance#

Metric

What It Measures

Lower or Higher is Better?

Common Use Case

MAE

Average absolute error

Lower

General-purpose evaluation

MSE

Average squared error

Lower

Penalizing large mistakes

RMSE

Square root of MSE

Lower

Interpretable error measurement

R² Score

Variance explained

Higher

Overall model quality

Mean Absolute Error (MAE)#

Mean Absolute Error measures the average size of prediction errors without considering whether the model overpredicted or underpredicted.

The formula is:

[
MAE = \frac{\sum |y_{true} - y_{pred}|}{n}
]

Suppose a model predicts fish weights:

100g

110g

150g

140g

200g

190g

Absolute errors:

  • 10g

  • 10g

  • 10g

MAE:

[
\frac{10 + 10 + 10}{3} = 10
]

The model is off by an average of 10 grams.

Why practitioners like MAE:

  • Easy to understand

  • Uses the same units as the target variable

  • Less sensitive to extreme outliers

Mean Squared Error (MSE)#

Mean Squared Error squares each prediction error before averaging.

[
MSE = \frac{\sum (y_{true} - y_{pred})^2}{n}
]

Using the same errors:

  • 10² = 100

  • 10² = 100

  • 10² = 100

MSE:

[
\frac{100 + 100 + 100}{3} = 100
]

The key difference is that large mistakes are penalized much more heavily.

For example:

  • Error of 2 → 4

  • Error of 10 → 100

This makes MSE useful when large prediction errors are especially costly, such as:

  • Medical predictions

  • Financial forecasting

  • Resource planning systems

Root Mean Squared Error (RMSE)#

RMSE is simply the square root of MSE.

[
RMSE = \sqrt{MSE}
]

If MSE = 100:

[
RMSE = 10
]

RMSE combines two useful properties:

  • Penalizes large errors like MSE

  • Returns values in the same units as the target variable

For this reason, RMSE is one of the most commonly reported regression metrics in industry and research.

Many machine learning practitioners prefer RMSE because it is easier to interpret than MSE while still highlighting significant prediction mistakes.

R² Score#

R² Score (Coefficient of Determination) measures how much of the variation in the target variable is explained by the model.

Unlike MAE, MSE, and RMSE, R² does not measure error directly.

Instead, it answers the question:

How well does the model explain the data?

Typical interpretations:

0.90

Excellent fit; explains 90% of variation

0.50

Moderate fit; explains 50% of variation

0.10

Weak fit; explains little of the variation

1.00

Perfect prediction

0.00

No better than predicting the mean

For example:

  • R² = 0.90 means the model explains most of the observed variation.

  • R² = 0.50 suggests useful predictive power but significant unexplained variance remains.

  • R² = 0.10 indicates the model may be missing important features or relationships.

Comparing two regression models#

Imagine two models predicting fish weight.

MAE

12

10

MSE

400

650

RMSE

20

25.5

0.82

0.80

At first glance, Model B appears better because it has a lower MAE.

However:

  • Model A has lower MSE

  • Model A has lower RMSE

  • Model A has slightly higher R²

This suggests that Model B performs well on average but occasionally makes very large mistakes. Model A is more consistent.

This example illustrates why relying on a single metric can be misleading.

Which metric should you use?#

Use the following decision guide:

Simple interpretation

MAE

Strong penalty for large errors

MSE

Real-world error units

RMSE

Overall explanatory power

In practice:

  • MAE is often a good starting point.

  • RMSE is widely used in production systems.

  • R² provides useful context about overall model quality.

  • MSE is valuable when large mistakes are particularly undesirable.

Common evaluation mistakes#

Evaluating only on training data#

A model may perform extremely well on training data but fail on unseen examples.

Always evaluate using a separate test set or cross-validation.

Comparing models using only one metric#

A model with the best MAE may not have the best RMSE or R².

Review multiple metrics before choosing a model.

Ignoring outliers#

A few extreme values can significantly affect MSE and RMSE.

Investigate unusual observations rather than blindly trusting metric values.

Misinterpreting R²#

A high R² does not guarantee a useful model.

A model can have a high R² while still producing prediction errors that are too large for a real-world application.

Calculating regression metrics with scikit-learn#

from sklearn.metrics import (mean_absolute_error,mean_squared_error,r2_score)import numpy as npy_true = [100, 150, 200, 250]y_pred = [110, 145, 190, 260]mae = mean_absolute_error(y_true, y_pred)mse = mean_squared_error(y_true, y_pred)rmse = np.sqrt(mse)r2 = r2_score(y_true, y_pred)print("MAE:", mae)print("MSE:", mse)print("RMSE:", rmse)print("R²:", r2)

Sample output:

MAE: 8.75MSE: 93.75RMSE: 9.68R²: 0.97

Final takeaway#

No single metric tells the whole story. MAE provides an intuitive measure of average error, MSE and RMSE highlight the impact of large mistakes, and R² shows how much of the underlying variation your model explains.

Strong regression model evaluation typically combines MAE, RMSE, and R² to build a complete picture of prediction performance. Looking at multiple metrics helps you make better decisions, compare models more effectively, and avoid being misled by any single measurement.

Multiple linear regression model in Python#

So far, only one feature, X-Length has been used to train the model. However, there are features available that can be utilized to improve the predictions. These features include the vertical length, diagonal length, height, and width of the fish, and can be used to re-evaluate the linear regression model.

# Step 3: Separating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']

Mathematically, the multiple linear regression model can be written as follows:

y=m1X1+m2X2++mnXn+Cy = m_1 \cdot X_1 + m_2 \cdot X_2 + \cdots + m_n \cdot X_n + C

where mim_i represents the weightage for feature XiX_i in predicting yy and nn denotes the number of features.

Following the similar steps as earlier, the performance of the model can be calculated by utilizing all the features.

Python
Species Weight V-Length D-Length X-Length Height Width
Bream 290 24 26.3 31.2 12.48 4.3056
Bream 340 23.9 26.5 31.1 12.3778 4.6961
Bream 363 26.3 29 33.5 12.73 4.4555
Bream 430 26.5 29 34 12.444 5.134
Bream 450 26.8 29.7 34.7 13.6024 4.9274
Bream 500 26.8 29.7 34.5 14.1795 5.2785
Bream 390 27.6 30 35 12.67 4.69
Bream 450 27.6 30 35.1 14.0049 4.8438
Bream 500 28.5 30.7 36.2 14.2266 4.9594
Bream 475 28.4 31 36.2 14.2628 5.1042
Bream 500 28.7 31 36.2 14.3714 4.8146
Bream 500 29.1 31.5 36.4 13.7592 4.368
Bream 340 29.5 32 37.3 13.9129 5.0728
Bream 600 29.4 32 37.2 14.9544 5.1708
Bream 600 29.4 32 37.2 15.438 5.58
Bream 700 30.4 33 38.3 14.8604 5.2854
Bream 700 30.4 33 38.5 14.938 5.1975
Bream 610 30.9 33.5 38.6 15.633 5.1338
Bream 650 31 33.5 38.7 14.4738 5.7276
Bream 575 31.3 34 39.5 15.1285 5.5695
Bream 685 31.4 34 39.2 15.9936 5.3704
Bream 620 31.5 34.5 39.7 15.5227 5.2801
Bream 680 31.8 35 40.6 15.4686 6.1306
Bream 700 31.9 35 40.5 16.2405 5.589
Bream 725 31.8 35 40.9 16.36 6.0532
Bream 720 32 35 40.6 16.3618 6.09
Bream 714 32.7 36 41.5 16.517 5.8515
Bream 850 32.8 36 41.6 16.8896 6.1984
Bream 1000 33.5 37 42.6 18.957 6.603
Bream 920 35 38.5 44.1 18.0369 6.3063
Bream 955 35 38.5 44 18.084 6.292
Bream 925 36.2 39.5 45.3 18.7542 6.7497
Bream 975 37.4 41 45.9 18.6354 6.7473
Bream 950 38 41 46.5 17.6235 6.3705
Roach 40 12.9 14.1 16.2 4.1472 2.268
Roach 69 16.5 18.2 20.3 5.2983 2.8217
Roach 78 17.5 18.8 21.2 5.5756 2.9044
Roach 87 18.2 19.8 22.2 5.6166 3.1746
Roach 120 18.6 20 22.2 6.216 3.5742
Roach 0 19 20.5 22.8 6.4752 3.3516
Roach 110 19.1 20.8 23.1 6.1677 3.3957
Roach 120 19.4 21 23.7 6.1146 3.2943
Roach 150 20.4 22 24.7 5.8045 3.7544
Roach 145 20.5 22 24.3 6.6339 3.5478
Roach 160 20.5 22.5 25.3 7.0334 3.8203
Roach 140 21 22.5 25 6.55 3.325
Roach 160 21.1 22.5 25 6.4 3.8
Roach 169 22 24 27.2 7.5344 3.8352
Roach 161 22 23.4 26.7 6.9153 3.6312
Roach 200 22.1 23.5 26.8 7.3968 4.1272
Roach 180 23.6 25.2 27.9 7.0866 3.906
Roach 290 24 26 29.2 8.8768 4.4968
Roach 272 25 27 30.6 8.568 4.7736
Roach 390 29.5 31.7 35 9.485 5.355
Whitefish 270 23.6 26 28.7 8.3804 4.2476
Whitefish 270 24.1 26.5 29.3 8.1454 4.2485
Whitefish 306 25.6 28 30.8 8.778 4.6816
Whitefish 540 28.5 31 34 10.744 6.562
Whitefish 800 33.7 36.4 39.6 11.7612 6.5736
Whitefish 1000 37.3 40 43.5 12.354 6.525
Parkki 55 13.5 14.7 16.5 6.8475 2.3265
Parkki 60 14.3 15.5 17.4 6.5772 2.3142
Parkki 90 16.3 17.7 19.8 7.4052 2.673
Parkki 120 17.5 19 21.3 8.3922 2.9181
Parkki 150 18.4 20 22.4 8.8928 3.2928
Parkki 140 19 20.7 23.2 8.5376 3.2944
Parkki 170 19 20.7 23.2 9.396 3.4104
Parkki 145 19.8 21.5 24.1 9.7364 3.1571
Parkki 200 21.2 23 25.8 10.3458 3.6636
Parkki 273 23 25 28 11.088 4.144
Parkki 300 24 26 29 11.368 4.234
Perch 5.9 7.5 8.4 8.8 2.112 1.408
Perch 32 12.5 13.7 14.7 3.528 1.9992
Perch 40 13.8 15 16 3.824 2.432
Perch 51.5 15 16.2 17.2 4.5924 2.6316
Perch 70 15.7 17.4 18.5 4.588 2.9415
Perch 100 16.2 18 19.2 5.2224 3.3216
Perch 78 16.8 18.7 19.4 5.1992 3.1234
Perch 80 17.2 19 20.2 5.6358 3.0502
Perch 85 17.8 19.6 20.8 5.1376 3.0368
Perch 85 18.2 20 21 5.082 2.772
Perch 110 19 21 22.5 5.6925 3.555
Perch 115 19 21 22.5 5.9175 3.3075
Perch 125 19 21 22.5 5.6925 3.6675
Perch 130 19.3 21.3 22.8 6.384 3.534
Perch 120 20 22 23.5 6.11 3.4075
Perch 120 20 22 23.5 5.64 3.525
Perch 130 20 22 23.5 6.11 3.525
Perch 135 20 22 23.5 5.875 3.525
Perch 110 20 22 23.5 5.5225 3.995
Perch 130 20.5 22.5 24 5.856 3.624
Perch 150 20.5 22.5 24 6.792 3.624
Perch 145 20.7 22.7 24.2 5.9532 3.63
Perch 150 21 23 24.5 5.2185 3.626
Perch 170 21.5 23.5 25 6.275 3.725
Perch 225 22 24 25.5 7.293 3.723
Perch 145 22 24 25.5 6.375 3.825
Perch 188 22.6 24.6 26.2 6.7334 4.1658
Perch 180 23 25 26.5 6.4395 3.6835
Perch 197 23.5 25.6 27 6.561 4.239
Perch 218 25 26.5 28 7.168 4.144
Perch 300 25.2 27.3 28.7 8.323 5.1373
Perch 260 25.4 27.5 28.9 7.1672 4.335
Perch 265 25.4 27.5 28.9 7.0516 4.335
Perch 250 25.4 27.5 28.9 7.2828 4.5662
Perch 250 25.9 28 29.4 7.8204 4.2042
Perch 300 26.9 28.7 30.1 7.5852 4.6354
Perch 320 27.8 30 31.6 7.6156 4.7716
Perch 514 30.5 32.8 34 10.03 6.018
Perch 556 32 34.5 36.5 10.2565 6.3875
Perch 840 32.5 35 37.3 11.4884 7.7957
Perch 685 34 36.5 39 10.881 6.864
Perch 700 34 36 38.3 10.6091 6.7408
Perch 700 34.5 37 39.4 10.835 6.2646
Perch 690 34.6 37 39.3 10.5717 6.3666
Perch 900 36.5 39 41.4 11.1366 7.4934
Perch 650 36.5 39 41.4 11.1366 6.003
Perch 820 36.6 39 41.3 12.4313 7.3514
Perch 850 36.9 40 42.3 11.9286 7.1064
Perch 900 37 40 42.5 11.73 7.225
Perch 1015 37 40 42.4 12.3808 7.4624
Perch 820 37.1 40 42.5 11.135 6.63
Perch 1100 39 42 44.6 12.8002 6.8684
Perch 1000 39.8 43 45.2 11.9328 7.2772
Perch 1100 40.1 43 45.5 12.5125 7.4165
Perch 1000 40.2 43.5 46 12.604 8.142
Perch 1000 41.1 44 46.6 12.4888 7.5958
Pike 200 30 32.3 34.8 5.568 3.3756
Pike 300 31.7 34 37.8 5.7078 4.158
Pike 300 32.7 35 38.8 5.9364 4.3844
Pike 300 34.8 37.3 39.8 6.2884 4.0198
Pike 430 35.5 38 40.5 7.29 4.5765
Pike 345 36 38.5 41 6.396 3.977
Pike 456 40 42.5 45.5 7.28 4.3225
Pike 510 40 42.5 45.5 6.825 4.459
Pike 540 40.1 43 45.8 7.786 5.1296
Pike 500 42 45 48 6.96 4.896
Pike 567 43.2 46 48.7 7.792 4.87
Pike 770 44.8 48 51.2 7.68 5.376
Pike 950 48.3 51.7 55.1 8.9262 6.1712
Pike 1250 52 56 59.7 10.6863 6.9849
Pike 1600 56 60 64 9.6 6.144
Pike 1550 56 60 64 9.6 6.144
Pike 1650 59 63.4 68 10.812 7.48
Smelt 6.7 9.3 9.8 10.8 1.7388 1.0476
Smelt 7.5 10 10.5 11.6 1.972 1.16
Smelt 7 10.1 10.6 11.6 1.7284 1.1484
Smelt 9.7 10.4 11 12 2.196 1.38
Smelt 9.8 10.7 11.2 12.4 2.0832 1.2772
Smelt 8.7 10.8 11.3 12.6 1.9782 1.2852
Smelt 10 11.3 11.8 13.1 2.2139 1.2838
Smelt 9.9 11.3 11.8 13.1 2.2139 1.1659
Smelt 9.8 11.4 12 13.2 2.2044 1.1484
Smelt 12.2 11.5 12.2 13.4 2.0904 1.3936
Smelt 13.4 11.7 12.4 13.5 2.43 1.269
Smelt 12.2 12.1 13 13.8 2.277 1.2558
Smelt 19.7 13.2 14.3 15.2 2.8728 2.0672
Smelt 19.9 13.8 15 16.2 2.9322 1.8792

The MAE values will be similar to the results obtained when using a single feature.

Polynomial regression#

This blog explains the concept of polynomial regression, which is used when the assumption of a linear relationship between the features and label is not accurate. By allowing for a more flexible fit to the data, polynomial regression can capture more complex relationships and lead to more accurate predictions.

For example, if the relationship between the dependent variables and the independent variable is not a straight line, a polynomial regression model can be used to model it more accurately. This can lead to a better fit to the data and more accurate predictions.

Mathematically, the relationship between dependent and independent variables is described using the following equation:

y=m1Z1+m2Z2++mnZn+C.y = m_1 \cdot Z_1 + m_2 \cdot Z_2 + \cdots + m_n \cdot Z_n + C.

The above equation looks very similar to the one used earlier to describe multiple linear regression. However, it includes the transformed features called ZiZ_i's which are the polynomial version of XiX_i's used in multiple linear regression.

This can be further explained using an example of two features X1X_1 and X2X_2 to create new features Z1=X12Z_1 = X_1^2, Z2=X22Z_2 = X_2^2, Z3=X1X2Z_3 = X_1X_2, Z4=X13Z_4 = X_1^3, Z5=X23Z_5 = X_2^3, Z6=X12X2Z_6 = X_1^2X_2, Z7=X1X22Z_7 = X_1X_2^2, and so on.

The new polynomial features can be created based on trial and error or techniques like cross-validation. The degree of the polynomial can also be chosen based on the complexity of the relationship between the variables.

The following example presents a polynomial regression and validates the models’ performance.

Python
# Step 1: Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
# Step 2: Defining the columns and reading the DataFrame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Step 3: Seperating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']
# Step 4: Generating polynomial features
Z = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)
# Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(Z, y, test_size=0.3, random_state=10)
# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)
# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating our trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))

The features were transformed using PolynomialFeatures function on line 18. The PolynomialFeatures function, imported from the sklearn library on line 7, was used for this purpose.

It should be noticed that the MAE value in this case is superior to that of linear regression models, implying that the linear assumption was not entirely accurate.

This blog has provided a quick introduction to Machine learning regression models with python. Don’t stop here! Explore and practice different techniques and libraries to build more accurate and robust models. You can also check out the following courses on Educative:

A Practical Guide to Machine Learning with Python

Cover
A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins
Beginner
108 Playgrounds
12 Quizzes

Hands-on Machine Learning with Scikit-Learn

Cover
Hands-on Machine Learning with Scikit-Learn

Scikit-Learn is a powerful library that provides a handful of supervised and unsupervised learning algorithms. If you’re serious about having a career in machine learning, then scikit-learn is a must know. In this course, you will start by learning the various built-in datasets that scikit-learn offers, such as iris and mnist. You will then learn about feature engineering and more specifically, feature selection, feature extraction, and dimension reduction. In the latter half of the course, you will dive into linear and logistic regression where you’ll work through a few challenges to test your understanding. Lastly, you will focus on unsupervised learning and deep learning where you’ll get into k-means clustering and neural networks. By the end of this course, you will have a great new skill to add to your resume, and you’ll be ready to start working on your own projects that will utilize scikit-learn.

5hrs
Intermediate
5 Challenges
2 Quizzes

Deploying a regression model with FastAPI#

Training a regression model is only one part of the machine learning workflow. In real-world applications, a trained model often needs to serve predictions to websites, dashboards, mobile apps, or other services.

This is where model deployment comes in. Instead of manually running predictions inside a Jupyter notebook, you can expose your model through an API. Applications send input data to the API, and the API returns predictions in real time. FastAPI has become a popular choice for this task because it is lightweight, fast, and works seamlessly with Python machine learning libraries.

Basic deployment workflow#

A typical machine learning deployment pipeline looks like this:

Train model↓Evaluate model↓Save model↓Load model in API↓Accept user input↓Return prediction

Once a model has been trained and evaluated, it can be saved to disk and reused without retraining every time the application starts.

Saving a trained regression model#

After training a scikit-learn model, you can save it using joblib.

import joblibjoblib.dump(model, "fish_weight_model.pkl")

This creates a file named fish_weight_model.pkl that contains the trained model. Later, your API can load this file and use it to generate predictions.

Creating a FastAPI prediction endpoint#

The following example demonstrates a simple FastAPI service that loads the trained model, accepts fish measurements as input, and returns a predicted fish weight.

from fastapi import FastAPIfrom pydantic import BaseModelimport joblibapp = FastAPI()model = joblib.load("fish_weight_model.pkl")class FishFeatures(BaseModel):v_length: floatd_length: floatx_length: floatheight: floatwidth: float@

This example demonstrates several useful FastAPI features:

  • Pydantic models validate incoming data automatically.

  • Type hints improve readability and reduce errors.

  • JSON responses make integration with other applications straightforward.

  • Automatic API documentation is generated without additional work.

Example API request#

A client can send a POST request containing fish measurements:

{"v_length": 25.4,"d_length": 27.3,"x_length": 30.0,"height": 11.5,"width": 4.8}

Example API response#

The API returns the model's prediction:

{"predicted_weight": 532.7}

The exact value will depend on the model and training data used.

Why FastAPI is useful for model serving#

FastAPI has become one of the most popular frameworks for deploying machine learning models because it offers a strong balance between simplicity and performance.

Some key advantages include:

  • Lightweight and easy to learn

  • Excellent performance for Python applications

  • Automatic OpenAPI and Swagger documentation

  • Native support for type hints and validation

  • Seamless integration with scikit-learn, pandas, NumPy, and other ML libraries

  • Easy deployment to cloud platforms and containerized environments

For many machine learning projects, FastAPI provides everything needed to move from experimentation to production.

Deployment best practices#

Before deploying a model, consider a few important practices:

  • Validate all incoming input data.

  • Keep feature ordering identical to the training process.

  • Save preprocessing logic alongside the model.

  • Monitor prediction quality after deployment.

  • Retrain models when underlying data changes.

  • Avoid exposing experimental or untested models directly to end users.

These practices help ensure that predictions remain reliable and consistent over time.

Keep preprocessing and prediction together#

For production systems, it is often best to save a complete scikit-learn pipeline instead of saving only the model.

For example:

from sklearn.pipeline import Pipelinepipeline = Pipeline([("scaler", scaler),("model", regression_model)])

Using a pipeline ensures that the same preprocessing steps applied during training are also applied during prediction, reducing the risk of inconsistent results.

Final takeaway#

A regression model becomes significantly more useful when it can serve predictions outside a notebook. FastAPI provides a simple and practical way to turn a trained scikit-learn model into a web API that other applications can consume.

As you continue learning machine learning, understanding how to deploy models is just as important as learning how to train them. Building APIs around your models helps bridge the gap between experimentation and real-world software systems.

Regression datasets to practice your machine learning skills#

Machine learning skills improve through experimentation. Once you understand the basics of regression, the best next step is to apply the same concepts across different datasets and compare how your models behave.

Each dataset teaches something slightly different. Some are clean and beginner-friendly, while others introduce categorical variables, missing values, nonlinear relationships, feature engineering, and overfitting risks. Practicing across multiple datasets helps you build stronger regression intuition.

Regression dataset comparison#

Dataset

Source

Target Variable

Difficulty

Concepts Practiced

California Housing

scikit-learn

Median house value

Beginner

Multiple linear regression, train-test split, model evaluation

Fish Market Dataset

Kaggle/UCI-style datasets

Fish weight

Beginner

Simple linear regression, feature selection, polynomial regression

Medical Insurance Cost Dataset

Kaggle

Insurance charges

Beginner

Multiple regression, categorical encoding, feature importance

Boston Housing

Historical reference only

House price

Beginner

Regression fundamentals, ethics discussion, legacy dataset awareness

Bike Sharing Demand

UCI/Kaggle

Bike rental count

Intermediate

Time-based features, seasonality, feature engineering

Automobile Price Dataset

UCI

Car price

Intermediate

Missing data, categorical encoding, feature selection

Energy Efficiency Dataset

UCI

Heating/cooling load

Advanced

Polynomial regression, nonlinear relationships

House Prices: Ames Housing

Kaggle

Sale price

Advanced

Feature engineering, cross-validation, model comparison

Student Performance Dataset

UCI

Student score/performance

Intermediate

Categorical encoding, correlation analysis, multiple regression

Beginner-friendly datasets#

California Housing#

The California Housing dataset predicts median house value using features such as income, location, population, and household information. It is available directly through scikit-learn, which makes it easy to load and use.

This dataset is beginner-friendly because it is already structured and works well for multiple linear regression. It is a good next step after simple one-feature regression because you can practice using several input variables at once.

Good techniques to practice:

  • Multiple linear regression

  • Train-test splitting

  • MAE, MSE, RMSE, and R² evaluation

  • Feature scaling

Medical Insurance Costs#

The Medical Insurance Cost dataset predicts insurance charges based on features such as age, BMI, smoking status, region, and number of children.

This dataset is useful because it introduces categorical variables. You will need to encode features like sex, smoker, and region before training a model.

Good techniques to practice:

  • Multiple linear regression

  • One-hot encoding

  • Feature importance

  • Comparing numeric and categorical predictors

Fish Market Dataset#

The Fish Market dataset predicts fish weight using measurements such as length, height, and width.

This is a great beginner dataset because the relationship between physical measurements and weight is intuitive. It also works well for comparing simple linear regression, multiple linear regression, and polynomial regression.

Good techniques to practice:

  • Simple linear regression

  • Feature selection

  • Polynomial regression

  • Visualizing predictions

Intermediate regression projects#

Bike Sharing Demand#

Bike Sharing Demand predicts the number of bike rentals based on weather, season, time, and calendar-related variables.

This dataset introduces real-world complexity because demand is affected by time, temperature, humidity, holidays, and user behavior.

You can practice:

  • Time-based feature engineering

  • Handling seasonality

  • Comparing linear and tree-based models

  • Evaluating prediction errors across different conditions

Student Performance Dataset#

The Student Performance dataset predicts student outcomes based on academic, demographic, and lifestyle-related features.

This dataset is useful for practicing categorical encoding and careful interpretation. Some features may appear correlated with performance, but correlation does not always imply causation.

You can practice:

  • Categorical encoding

  • Correlation analysis

  • Feature selection

  • Ethical interpretation of model results

Automobile Price Prediction#

The Automobile Price dataset predicts car prices using features such as engine size, horsepower, fuel type, body style, and brand.

This dataset is valuable because it often includes missing values and mixed data types. It teaches you that preprocessing is often just as important as model selection.

You can practice:

  • Missing value handling

  • Encoding categorical variables

  • Feature selection

  • Model comparison

Advanced regression projects#

Ames Housing Prices#

Ames Housing is a richer and more realistic housing dataset than many beginner examples. It includes many numeric and categorical features related to property size, location, condition, year built, and sale details.

This dataset is excellent for practicing end-to-end regression workflows.

You can practice:

  • Advanced feature engineering

  • Cross-validation

  • Handling many categorical features

  • Comparing linear models with tree-based models

  • Preventing overfitting

Energy Efficiency Dataset#

The Energy Efficiency dataset predicts heating and cooling loads based on building characteristics.

This dataset is useful because relationships between features and targets may be nonlinear. It is a strong fit for polynomial regression and model comparison.

You can practice:

  • Polynomial regression

  • Nonlinear feature relationships

  • Multi-output regression concepts

  • Model evaluation across different target variables

Other multi-feature datasets#

Once you are comfortable with beginner and intermediate datasets, try regression problems with many features, missing data, or domain-specific variables.

These datasets help you practice the full machine learning workflow: cleaning data, selecting features, training models, evaluating results, and explaining trade-offs.

What to practice with each dataset#

Simple Linear Regression

Fish Market

Multiple Linear Regression

California Housing

Polynomial Regression

Energy Efficiency

Feature Selection

Automobile Prices

Categorical Encoding

Student Performance

Model Evaluation

House Prices

Suggested learning progression#

Step 1: Fish Market Dataset#

Start here if you want to reinforce the basics. Use one feature first, then add more features and compare performance.

Step 2: California Housing#

Move to California Housing when you are ready for multiple linear regression. This dataset helps you practice working with several numerical features.

Step 3: Medical Insurance Costs#

Use this dataset to learn categorical encoding and feature interpretation. It is especially useful for understanding how one feature, such as smoking status, can strongly affect predictions.

Step 4: Bike Sharing Demand#

This dataset introduces time-based patterns and more realistic feature engineering. It helps you think beyond simple columns and rows.

Step 5: Ames Housing#

Use Ames Housing when you are ready for a larger, more realistic regression challenge. This dataset is ideal for cross-validation, feature engineering, and model comparison.

Common regression mistakes when practicing#

Using all features without analysis#

More features do not always mean a better model. Some features may be noisy, redundant, or irrelevant.

Ignoring train-test separation#

Always split your data before evaluating your model. Testing on the same data used for training gives overly optimistic results.

Overfitting polynomial models#

Polynomial regression can fit curves well, but higher-degree models can memorize training data instead of learning useful patterns.

Misinterpreting correlation#

A strong correlation does not prove that one feature causes the target variable to change. Be careful when explaining model results.

Using MAE alone for evaluation#

MAE is useful, but it does not tell the full story. Compare multiple metrics such as MAE, RMSE, and R² to better understand model performance.

Final recommendations#

If you are new to regression, do not jump directly into the largest dataset. Start with small, intuitive datasets where you can clearly understand the relationship between features and the target variable.

A strong practice path looks like this:

Fish Market↓California Housing↓Medical Insurance Costs↓Bike Sharing Demand↓Ames Housing

The fastest way to become comfortable with regression is to solve similar prediction problems across multiple datasets. Each new dataset introduces different challenges and helps you build stronger machine learning intuition over time.

Frequently Asked Questions

What are the 3 types of regression?

There are three main types of regression: linear, multiple, and logistic. Linear regression models a simple straight-line relationship between a dependent variable and one independent variable. Multiple regression extends linear regression to include two or more independent variables when predicting a dependent variable. Logistic regression predicts the probability of a binary outcome using a logistic function, which is suitable for classification problems.

What are the regression models in Python?

The top 7 regression algorithms frequently utilized in Python and machine learning are linear regression, polynomial regression, ridge regression, lasso regression, elastic net regression, decision tree-based methods, and support vector regression (SVR).


Written By:
Najeeb Ul Hassan