Regression with PyCaret

Let’s learn how to import necessary libraries and datasets for regression with PyCaret.

The linear regression model

A fundamental task in supervised machine learning is regression where the goal is to predict a continuous value. This is achieved by understanding the relationship between the target variable yy and the feature variables xx on a given dataset. One of the most basic regression models is linear regression. It is defined in the following equation. The equivalent vectorized form of the equation is also provided, where the inner product of the transposed vector βT\beta^{T} and XnX_n​​ is calculated.

yn=β0+β1xn1++βpxnp+ϵn=βTXn+ϵny_{n}=\beta_{0}+\beta_{1} x_{n 1}+\cdots+\beta_{p} x_{n p}+\epsilon_{n}= \beta^{T}_{} X_n +\epsilon_{n}

  • yny_{n} is the target variable for the nnth instance of the given dataset.
  • x1x_{1} to xpx_{p} are the feature variables.
  • β0\beta_{0} is the intercept term.
  • β1\beta_{1} to βp\beta_{p} are the coefficients of the feature variables.
  • ϵ\epsilon is the error variable.

Regression methods in PyCaret

Besides linear regression, we have other regression models such as lasso, random forest, support vector machines, and gradient boosting. In the remaining lessons, we’ll see how PyCaret can help us choose and train the optimal regression model for a specific dataset. We’ll also learn about exploratory data analysis (EDAExploratory Data Analysis), a method that lets us examine and understand the basic statistical properties of a dataset.

Importing the necessary libraries

First, we import the Python libraries that are necessary for our project.

# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from pycaret.datasets import get_data
from pycaret.regression import *
mpl.rcParams['figure.dpi'] = 300

Some standard machine learning libraries are included, such as pandas, Matplotlib, and Seaborn. We also import all PyCaret functions that are related to regression. The last line specifies that Matplotlib figures will have a 300 DPI resolution, but we can omit that if we wish.

Loading the dataset

Machine learning projects can only succeed if the appropriate data is available, so PyCaret includes a variety of datasets that can be used to test its features. In this chapter, we’ll use insurance.csv, a dataset that originates from the book Machine Learning with R by Brett Lantz. This is a health insurance dataset, where the features are various attributes including age, sex, body mass index (BMI), whether the person is a smoker or not, number of children, and US region. Furthermore, the dataset’s target variable is the billed charges for every individual. Real-world data is usually more complex, but working with so-called toy datasets will help us grasp the concepts and techniques before dealing with more difficult cases.

We use the get_data() PyCaret function to load the dataset to a pandas dataframe.

# Loading/Importing dataset
data = get_data('insurance')

As we can see, the output is equivalent to the head() pandas function that prints the first five dataset rows. This lets us get a first glimpse of the data we are working with.

We use the pandas info() function to examine some basic information about the dataset.

# Getting dataset info
data.info()

As we can see in the output, there are 13381338 rows and none of the columns have null values. Furthermore, the data type of each column has been automatically inferred by the pandas library.