How to perform sklearn Ridge regression
Sklearn is a powerful open-source machine learning library that provides efficient tools for data analysis, classification, regression, and more. In this Answer, we will look into applying Ridge regression using sklearn.
Ridge regression
Ridge regression is a commonly used regularization technique. It is a model tuning method that is used to perform analysis on data that has multicollinearity. Multicollinearity is a concept involved in statistics in which multiple independent columns are correlated with one another. It causes a large
Dataset
We will be using the fetch_california_housing dataset from Sklearn and converting it into a pandas DataFrame. The major information that the fetch_california_housing object returns is:
data: Each row contains the data corresponding to the eight features. (rows = 20640, columns = 8)target: Each row contains the average house value in units of 100000. (rows = 20640, columns = 1)feature_names: The array of the feature names. Its length is 8.
The code to import the dataset and create a data frame is given below:
from sklearn.datasets import fetch_california_housing import pandas as pd housing_data = fetch_california_housing() housing_df = pd.DataFrame(housing_data["data"] , columns=housing_data["feature_names"]) housing_df["target"]=housing_data["target"] print(housing_df.head())
Code explanation
Line 1: We import the
fetch_california_housingdataset from thesklearn.datasets.Line 4: We create an instance of the dataset.
Line 6: We create a
DataFrameby passinghousing_data["data"]andhousing_data["feature_names"].housing_data["data"]contains the array of data of all columns, andhousing_data["feature_names"]contains an array of column names.Line 7: We create a
targetcolumn in our data frame by passing thehousing_data["target"]. Thehousing_data["target"]is the output feature of our data set, which is not included inhousing_data["data"].
Train-test data split
To perform the Ridge regression, we have to split our data into training and testing data. The code to perform the train-test split is given below:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing["data"] , columns=housing["feature_names"])
housing_df["target"]=housing["target"]
print(housing_df.head())
np.random.seed(42)
X = housing_df.drop('target' , axis = 1)
Y = housing_df["target"]
x_train, x_test, y_train, y_test = train_test_split(X , Y , test_size = 0.2)Code explanation
Line 13: We create the
Xlabel (features) by removing thetargetcolumn.Line 14: We create the
Ylabel (output) by considering thetargetcolumn only.Line 16: We divide our
XandYlabels into train and test data using thetrain_test_splitfunction. We set the test data size to 0.2 (20% data of the whole dataset).
Apply Ridge regression
Now that we have split our data into train and test data, we can apply the Ridge regression model to predict the house values. The code is given below:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
import numpy as np
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing["data"] , columns=housing["feature_names"])
housing_df["target"]=housing["target"]
print(housing_df.head())
np.random.seed(42)
X = housing_df.drop('target' , axis = 1)
Y = housing_df["target"]
x_train, x_test, y_train, y_test = train_test_split(X , Y , test_size = 0.2)
model = Ridge()
model.fit(x_train , y_train)
score = model.score(x_test , y_test)
print("Accuracy of the model is: " + str(score*100) + "%")
Code explanation
Line 4: We import the
Ridgemodel fromsklearn.linear_model.Line 19: We create an instance of the
Ridgemodel.Line 20: We train the model on the training data using the
fitfunction.Line 21: After training the model, we predict the house value by passing the test data. The
scorefunction performs prediction and calculates the accuracy of the model.
Conclusion
Ridge regression is a powerful technique to regularize linear regression models when dealing with multicollinearity. Sklearn allows to implement Ridge regression simply and concisely. By following the steps discussed in this Answer, we can easily incorporate Ridge regression into our machine-learning projects.
Free Resources