# Regression with PyCaret

Let’s learn how to import necessary libraries and datasets for regression with PyCaret.

We'll cover the following

## The linear regression model

A fundamental task in supervised machine learning is regression where the goal is to predict a continuous value. This is achieved by understanding the relationship between the target variable $y$ and the feature variables $x$ on a given dataset. One of the most basic regression models is linear regression. It is defined in the following equation. The equivalent vectorized form of the equation is also provided, where the inner product of the transposed vector $\beta^{T}$ and $X_n$​​ is calculated.

$y_{n}=\beta_{0}+\beta_{1} x_{n 1}+\cdots+\beta_{p} x_{n p}+\epsilon_{n}= \beta^{T}_{} X_n +\epsilon_{n}$

• $y_{n}$ is the target variable for the $n$th instance of the given dataset.
• $x_{1}$ to $x_{p}$ are the feature variables.
• $\beta_{0}$ is the intercept term.
• $\beta_{1}$ to $\beta_{p}$ are the coefficients of the feature variables.
• $\epsilon$ is the error variable.

## Regression methods in PyCaret

Besides linear regression, we have other regression models such as lasso, random forest, support vector machines, and gradient boosting. In the remaining lessons, we’ll see how PyCaret can help us choose and train the optimal regression model for a specific dataset. We’ll also learn about exploratory data analysis (EDAExploratory Data Analysis), a method that lets us examine and understand the basic statistical properties of a dataset.

## Importing the necessary libraries

First, we import the Python libraries that are necessary for our project.

# Importing necessary librariesimport pandas as pdimport matplotlib.pyplot as plt import matplotlib as mplimport seaborn as snsfrom pycaret.datasets import get_data from pycaret.regression import * mpl.rcParams['figure.dpi'] = 300

Some standard machine learning libraries are included, such as pandas, Matplotlib, and Seaborn. We also import all PyCaret functions that are related to regression. The last line specifies that Matplotlib figures will have a 300 DPI resolution, but we can omit that if we wish.

Machine learning projects can only succeed if the appropriate data is available, so PyCaret includes a variety of datasets that can be used to test its features. In this chapter, we’ll use insurance.csv, a dataset that originates from the book Machine Learning with R by Brett Lantz. This is a health insurance dataset, where the features are various attributes including age, sex, body mass index (BMI), whether the person is a smoker or not, number of children, and US region. Furthermore, the dataset’s target variable is the billed charges for every individual. Real-world data is usually more complex, but working with so-called toy datasets will help us grasp the concepts and techniques before dealing with more difficult cases.

We use the get_data() PyCaret function to load the dataset to a pandas dataframe.

# Loading/Importing datasetdata = get_data('insurance')

As we can see, the output is equivalent to the head() pandas function that prints the first five dataset rows. This lets us get a first glimpse of the data we are working with.

We use the pandas info() function to examine some basic information about the dataset.

# Getting dataset infodata.info()

As we can see in the output, there are $1338$ rows and none of the columns have null values. Furthermore, the data type of each column has been automatically inferred by the pandas library.