How to perform sklearn Ridge regression

Sklearn is a powerful open-source machine learning library that provides efficient tools for data analysis, classification, regression, and more. In this Answer, we will look into applying Ridge regression using sklearn.

Ridge regression

Ridge regression is a commonly used regularization technique. It is a model tuning method that is used to perform analysis on data that has multicollinearity. Multicollinearity is a concept involved in statistics in which multiple independent columns are correlated with one another. It causes a large varianceA measure that how much a value differs from the mean value. and unbiased least-squaresA statistical method to find the best fit for a set of data points. resulting in predicted values being far off from the actual values.

Dataset

We will be using the fetch_california_housing dataset from Sklearn and converting it into a pandas DataFrame. The major information that the fetch_california_housing object returns is:

data: Each row contains the data corresponding to the eight features. (rows = 20640, columns = 8)
target: Each row contains the average house value in units of 100000. (rows = 20640, columns = 1)
feature_names: The array of the feature names. Its length is 8.

The code to import the dataset and create a data frame is given below:

Code explanation

Line 1: We import the fetch_california_housing dataset from the sklearn.datasets.
Line 4: We create an instance of the dataset.
Line 6: We create a DataFrame by passing housing_data["data"] and housing_data["feature_names"]. housing_data["data"] contains the array of data of all columns, and housing_data["feature_names"] contains an array of column names.
Line 7: We create a target column in our data frame by passing the housing_data["target"]. The housing_data["target"] is the output feature of our data set, which is not included in housing_data["data"].

Train-test data split

To perform the Ridge regression, we have to split our data into training and testing data. The code to perform the train-test split is given below:

from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"] , columns=housing["feature_names"])
housing_df["target"]=housing["target"]
print(housing_df.head())

np.random.seed(42)

X = housing_df.drop('target' , axis = 1)
Y = housing_df["target"]

x_train, x_test, y_train, y_test = train_test_split(X , Y , test_size = 0.2)

Code to split the data into training and testing data

Code explanation

Line 13: We create the X label (features) by removing the target column.
Line 14: We create the Y label (output) by considering the target column only.
Line 16: We divide our X and Y labels into train and test data using the train_test_split function. We set the test data size to 0.2 (20% data of the whole dataset).

Apply Ridge regression

Now that we have split our data into train and test data, we can apply the Ridge regression model to predict the house values. The code is given below:

from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
import numpy as np
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"] , columns=housing["feature_names"])
housing_df["target"]=housing["target"]
print(housing_df.head())

np.random.seed(42)

X = housing_df.drop('target' , axis = 1)
Y = housing_df["target"]

x_train, x_test, y_train, y_test = train_test_split(X , Y , test_size = 0.2)

model = Ridge()
model.fit(x_train , y_train)
score = model.score(x_test , y_test)
print("Accuracy of the model is: " + str(score*100) + "%")

Code to implement Ridge regression

Code explanation

Line 4: We import the Ridge model from sklearn.linear_model.
Line 19: We create an instance of the Ridge model.
Line 20: We train the model on the training data using the fit function.
Line 21: After training the model, we predict the house value by passing the test data. The score function performs prediction and calculates the accuracy of the model.

Conclusion

Ridge regression is a powerful technique to regularize linear regression models when dealing with multicollinearity. Sklearn allows to implement Ridge regression simply and concisely. By following the steps discussed in this Answer, we can easily incorporate Ridge regression into our machine-learning projects.

Free Resources