Scikit-learn tutorial: How to implement linear regression

Home/

Blog/

Data Science/

13 mins read

Oct 13, 2020

Content

What is Scikit-Learn?

Advantages of Scikit-Learn

Libraries used with Scikit-learn

Getting started with Scikit-learn

Import Scikit-learn

Datasets and import.sklearn

Generate synthetic regression data

Plotting data with matplotlib

Keep learning about Scikit-learn.

Data Preprocessing

Build leak-proof workflows with Pipeline and ColumnTransformer

MinMax

Standard

Scikit-learn Linear Regression: implement an algorithm

How to implement linear regression

Strengthen baselines with regularized linear models

Prefer modern datasets and evaluate with cross-validation

Ship it: persistence and reproducibility

Wrapping up and next steps

Continue reading about data science and machine learning

Machine learning is quickly becoming the most sought after skill in the job market. Most employers are specifically looking for candidates with experience in Scikit-learn, the most popular ML Python library. Scikit-learn is a library for Python that provides machine learning developers with many unsupervised and supervised learning algorithms.

Today, we’ll explore this awesome library and show you how to implement its core functions. In the end, we’ll combine what we’ve learned to implement your own linear regression algorithm.

What is Scikit-Learn?#

Scikit-learn (or sklearn for short) is a free open-source machine learning library for Python. It is designed to cooperate with SciPy and NumPy libraries and simplifies data science techniques in Python with built-in support for popular classification, regression, and clustering machine learning algorithms.

Sklearn serves as a unifying point for many ML tools to work seamlessly together. It also gives data scientists a one-stop-shop toolkit to import, preprocess, plot, and predict data.

The project was started by David Cournapeau during the 2007 Google Summer of Code, and this library has grown over the last decade in both popularity and features. Scikit-learn is now the most popular machine learning library on Github.

Scikit-learn provides tools for:

Regression, including Linear and Logistic Regression
Classification, including K-Nearest Neighbors
Model selection
Clustering, including K-Means and K-Means++
Preprocessing, including Min-Max Normalization

Advantages of Scikit-Learn#

Developers and machine learning engineers use Sklearn because:

It’s easy to learn and use.
It’s free and open-source.
It helps in all aspects and algorithms of machine learning, even deep learning.
It’s very versatile and powerful.
Detailed documentation and active community.
It is the most widely used Machine Learning toolkit.

Libraries used with Scikit-learn#

Scikit-learn is a tool kit to expand the functions of the existing SciPy Stack (sometimes called the NumPy Stack). Below, we outline how Scikit-learn uses each library within the SciPy stack for data analysis.

NumPy: Advanced linear algebra and NumPy array operations
SciPy: Contains modules for optimization, linear algebra, and other essential data science functions.
Matplotlib: Visualization and data plotting in 2 or 3 dimensions.
IPython: Increasing console interactivity.
SymPy: Symbolic computation and computer algebra.
Pandas: Data manipulation and analysis, mainly through dataframes and tables.

Getting started with Scikit-learn#

Today, I’ll show you how to implement your own linear regression algorithm with scikit learn. Before we begin, you’ll need some foundational knowledge of:

The purpose of ML and data science.
How machine learning algorithms differ from one another.
Linear algebra, how it relates to ML.

Import Scikit-learn#

First, you’ll need to install Scikit-Learn. We’ll use pip for this, but you may also use conda if you prefer.

For Scikit-learn to work correctly, you’ll need a 64-bit version of Python 3, and the NumPy and SciPy libraries. For visual data plots, you’ll also need matplotlib.

To install Scikit-learn enter the following line into your Python 3.

 pip install -U scikit-learn

Then to verify the installation, enter:

python -m pip show scikit-learn # displays which version and where sklearn is installed
python -m pip freeze # displays all packages installed in virtualenv
python -c "import sklearn; sklearn.show_versions()"

Linux users: add 3 after pip and python in the above lines → pip3, python3.

Now to install NumPy, SciPy and, matplotlib, enter:

pip install -U numpy
pip install -U scipy
pip install -U matplotlib

As we did before, we’ll confirm the installation of each with:

python -m pip show numpy
python -m pip show scipy
python -m pip show matplotlib

Now you’re ready to start using Scikit-learn! Let’s jump into our tutorial by importing a dataset.

Datasets and import.sklearn#

The starting point for all machine learning projects is to import your dataset. Scikit-learn includes three helpful options to get data to practice with.

First, the library contains famous datasets like the iris classification dataset or the Boston housing price regression set if you want to practice on a classic set. You can also use Scikit-learn’s predefined functions to download real-world datasets directly from the internet, such as 20 newsgroups.

Finally, you can simply generate a random dataset to match a certain pattern using Scikit-learn’s data generator. Each of these options requires you to import the datasets module:

import sklearn.datasets as datasets

First, we’ll import the iris classification set to see how it’s stored in sklearn.

iris = datasets.load_iris()

The iris data set is imported as a dictionary-like object with all necessary data and metadata. The data is stored in the 2D array data field of n_samples * n_features.

We can get descriptions of the data and its formatting by using the DESCR, shape, and _names functions. If we print the results of these functions we’ll discover all the information we could need to work with the iris set.

Targets and Features:

All ML algorithms attempt to increase their understanding of a certain variable, called the target variable. The algorithm then attempts to uncover an unseen relationship between the target variable and other passed feature variables.

Generate synthetic regression data#

If you don’t want to use any of the built-in datasets, you can generate your own data to match a chosen distribution. Below, we’ll see how to generate regression data and plot it using matplotlib.

First, import matplotlib using:

import matplotlib.pyplot as plt

Now, we’ll generate a simple regression data set with 1 feature and 1 informative feature.

X, y = datasets.make_regression(n_features=1, n_informative=1)

This generates our dataset and saves it to 2D dataset object x, y. Changing the parameters of the make_regression function will alter the characteristics of the data generated. Here, we change the features and informative parameters from their default 10 to instead be just 1.

Other popular parameters include samples that control the number of samples and targets that control how many target variables are tracked.

Informative vs non-informative feature:

An informative feature is one that provides useful, applicable information to the ML algorithm. These are the data points used to form the trend in regression analysis algorithms. Non-informative features are discarded as unhelpful.

Plotting data with matplotlib#

We’ll now plot this graph by entering:

fig, axe = plt.subplots(dpi = 300)
axe.scatter(X, y, marker='o')
axe.set_title("Data generated from make_regression")
fig.savefig("output/img.png")
plt.close(fig)

Data Preprocessing#

Most ML engineers agree that data preprocessing is one of the most important steps in the project process. No dataset is perfect: there can be an extraneous data point, reporting errors, and any number of issues that interfere with an algorithm’s prediction.

To prevent this, data scientists spend many hours cleaning, normalizing, and scaling data long before it ever passes into an ML algorithm.

The most common function type you’ll use in this stage are standardizing functions, namely the MinMax and Standard functions. This is because features in your data will vary in range. However, nearly all ML algorithms use Euclidean distance to measure the distance between two data points.

Scale standardization functions allow algorithms to properly measure distance by scaling all points in the set to fit the same range.

Both will require you to first import sklearn.preprocessing and numpy:

import sklearn.preprocessing as preprocessing
import numpy as np

Build leak-proof workflows with Pipeline and ColumnTransformer#

Real-world projects mix numeric and categorical features, and applying scalers or encoders separately can leak information from the test set into training. A safer pattern in Scikit-learn is to define transformations by column and chain them with the predictor in a single Pipeline. With ColumnTransformer you specify which columns are scaled, which are one-hot encoded, and which pass through untouched. The Pipeline then fits only on the training folds and applies the exact same steps to validation and test data. This approach keeps preprocessing, model fitting, and prediction together, making your code easier to maintain and less error-prone. It also ensures that linear models see standardized inputs and that any categorical features are consistently expanded into the same set of columns at inference time. As your project grows, you can add imputers, feature selectors, and polynomial feature generators into the same Pipeline without changing your training loop.

MinMax#

MinMax shrinks the range of each figure to be between 0 and 1.


import sklearn.preprocessing as preprocessing
 
minmax = preprocessing.MinMaxScaler()
## X is a matrix with float type
minmax.fit(X)
X_minmax = minmax.transform(X)

Line 3 creates a MinMaxScaler named minmax.
Line 5 fits the original scale matrix to the Scaler
Line 6 transforms the original matrix to match the fitted matrix X

Here’s an example of our MinMaxScalar in action!

Standard#

If your data instead follows standard deviation, you can use the StandardScaler instead. This scaler fits a passed data set to be a standard scale along with the standard deviation.

import sklearn.preprocessing as preprocessing

std = preprocessing.StandardScaler()
# X is a matrix
std.fit(X)
X_std = std.transform(X)

Like above, we first create the scaler on line 3, fit the current matrix on line 5, and finally transform the original matrix on line 6.

Let’s see how this scales our same example from above:

Scikit-learn Linear Regression: implement an algorithm#

Now we’ll implement the linear regression machine learning algorithm using the Boston housing price sample data. As with all ML algorithms, we’ll start with importing our dataset and then train our algorithm using historical data.

Linear regression is a predictive model often used by real businesses. Linear regression seeks to predict the relationship between a scalar response and related explanatory variables to output value with realistic meaning like product sales or housing prices.

This model is best used when you have a log of previous, consistent data and want to predict what will happen next if the pattern continues.

From a mathematical point of view, linear regression is about fitting data to minimize the sum of residuals between each data point and the predicted value. In other words, we are minimizing the discrepancy between the data and the estimation model.

As shown in the figure below, the red line is the model we solved, the blue point is the original data, and the distance between the point and the red line is the residual. Our goal is to minimize the sum of residuals.

Python 3.5

import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
house = datasets.load_boston()
print("The data shape of house is {}".format(house.data.shape))
print("The number of feature in this data set is {}".format(
    house.data.shape[1]))
train_x, test_x, train_y, test_y = train_test_split(house.data,
                                                    house.target,
                                                    test_size=0.2,
                                                    random_state=42)
print("The first five samples {}".format(train_x[:5]))
print("The first five targets {}".format(train_y[:5]))
print("The number of samples in train set is {}".format(train_x.shape[0]))
print("The number of samples in test set is {}".format(test_x.shape[0]))
lr = LinearRegression()
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)
print("The first five prediction {}".format(pred_y[:5]))
print("The real first five labels {}".format(test_y[:5]))
mse = metrics.mean_squared_error(test_y, pred_y)
print("Mean Squared Error {}".format(mse))

At line 6, we load the dataset by calling load_boston.
At line 12, we split the dataset into two parts: the train set (80%), and the test set (20%).
At line 23, A linear regression model is created and trained at (in sklearn, the train is equal to fit).
At line 29, we call mean_squared_error to evaluate the performance of this model.

Strengthen baselines with regularized linear models#

Linear regression is a strong baseline, but regularization often improves stability and accuracy when features are correlated or noisy. In Scikit-learn, Ridge adds an L2 penalty to shrink coefficients smoothly, while Lasso uses an L1 penalty that can drive some coefficients to zero, acting as embedded feature selection. Elastic Net blends both. Start with a small grid over the regularization strength and evaluate with cross-validation inside your Pipeline. Regularized linear models are still fast and interpretable, but they trade a bit of bias for lower variance, which is usually a win on real, messy data. After selecting the best regularized model, inspect coefficient magnitudes on the standardized feature space to understand drivers of the prediction; this keeps interpretability while improving generalization.

Prefer modern datasets and evaluate with cross-validation#

The Boston housing dataset has been removed; choose a maintained alternative such as the California housing data or a CSV from your domain. After loading the data, split into train and test, then wrap your preprocessing and LinearRegression into a Pipeline. Instead of relying on a single train/test split, use K-fold cross-validation to estimate generalization. For regression tasks, report R² to understand explained variance and pair it with MAE or RMSE to communicate average error in target units. Cross-validation gives you a distribution of scores rather than a single number, which is more robust when feature engineering or comparing models. Finally, plot residuals versus predictions to visually check for heteroscedasticity and obvious misspecifications; linear regression assumes a roughly linear relationship and constant variance, so diagnostics are part of the learning loop, not an afterthought.

Ship it: persistence and reproducibility#

Once satisfied with performance, persist the entire Pipeline so preprocessing and the estimator travel together. Use joblib to dump and load the fitted object for deployment. Record the version of Scikit-learn, the random_state used during training, and the exact feature names or column order. Reproducibility matters when you revisit experiments or compare new iterations, and keeping a single, versioned artifact reduces drift between training and production. For batch scoring, load the Pipeline and call predict on new data shaped with the same columns; for interactive apps, encapsulate preprocessing and prediction in a small function that accepts raw inputs and returns both predictions and residual diagnostics so users can trust and debug results.

Wrapping up and next steps#

You’ve just taken your first steps to master Scikit-Learn. Today, we covered the purpose of Sklearn, how to import or generate sample data, how to scale our data, and how to implement the popular linear regression algorithm.

As you continue your Scikit-learn journey, here are some next algorithms and topics to learn:

Support Vector machine
Random Forest
Naive Bayes model
Unsupervised learning
Deep learning
Logistic regression

To help you get started, Educative has created the course Hands-on Machine Learning with Scikit-Learn. With in-depth explanations of all the Scikit-learn basics and popular ML algorithms, this course will give you everything you need in one place. By the end of this course, you’ll know how and when to use each algorithm and will have the Scikit skills to stand out to any interviewer.

Continue reading about data science and machine learning#

Written By:

Ryan Thelin

Free Resources

blog

Julia vs. Python: A comprehensive comparison

blog

R Tutorial: a quick beginner's guide to using R

blog

Kubernetes: A Comprehensive Tutorial for Beginners

Scikit-learn tutorial: How to implement linear regression

Fast-track your Scikit-learn knowledge, without all the web searching
Master the most popular Scikit-learn functions and ML algorithms using interactive examples, all in one place.

Hands-on Machine Learning with Scikit-Learn

What is Scikit-Learn?#

Advantages of Scikit-Learn#

Libraries used with Scikit-learn#

Getting started with Scikit-learn#

Import Scikit-learn#

Datasets and import.sklearn#

Generate synthetic regression data#

Plotting data with matplotlib#

Keep learning about Scikit-learn.#

Data Preprocessing#

Build leak-proof workflows with Pipeline and ColumnTransformer#

MinMax#

Standard#

Scikit-learn Linear Regression: implement an algorithm#

How to implement linear regression#

Strengthen baselines with regularized linear models#

Prefer modern datasets and evaluate with cross-validation#

Ship it: persistence and reproducibility#

Wrapping up and next steps#

Continue reading about data science and machine learning#

Scikit-learn tutorial: How to implement linear regression

Fast-track your Scikit-learn knowledge, without all the web searching Master the most popular Scikit-learn functions and ML algorithms using interactive examples, all in one place. Hands-on Machine Learning with Scikit-Learn

What is Scikit-Learn?#

Advantages of Scikit-Learn#

Libraries used with Scikit-learn#

Getting started with Scikit-learn#

Import Scikit-learn#

Datasets and import.sklearn#

Generate synthetic regression data#

Plotting data with matplotlib#

Keep learning about Scikit-learn.#

Data Preprocessing#

Build leak-proof workflows with Pipeline and ColumnTransformer#

MinMax#

Standard#

Scikit-learn Linear Regression: implement an algorithm#

How to implement linear regression#

Strengthen baselines with regularized linear models#

Prefer modern datasets and evaluate with cross-validation#

Ship it: persistence and reproducibility#

Wrapping up and next steps#

Continue reading about data science and machine learning#

Fast-track your Scikit-learn knowledge, without all the web searching
Master the most popular Scikit-learn functions and ML algorithms using interactive examples, all in one place.

Hands-on Machine Learning with Scikit-Learn