What is regression in PyCaret?

Regression analysis is a statistical method used to create a connection between dependent and independent features. Regression measures whether or not there is a statistically significant connection between the variables that were observed in a data set. It captures that association, and it is employed in the fields of finance, investing, and other professions.

One of the most often used types of regression analysis is linear regression.

What is PyCaret?

PyCaret is an open-source library in Python based on a low-code paradigm for the automation of machine-learning workflows.

Importance of PyCaret

The significance of PyCaret is explained below:

  • Workflows for machine learning are automated via the open-source, Python-based pycaret module. This all-encompassing model management and machine learning system greatly increases output and shortens the experiment cycle.

  • With just a few lines of code instead of hundreds, pycaret a low-code library and other free and open-source machine-learning tools can be used.

  • It is a machine-learning solution by data scientists of all levels who want to work more productively and produce quick prototypes. It also seamlessly connects with a variety of other systems, such as Microsoft Power BI, Tableau, Alteryx, and KNIME.

  • Machine learning libraries and frameworks, such as XGBoost, Ray, Hyperopt, scikit-learn, LightGBM, CatBoost, spaCy, Optuna, and a few others, are effectively wrapped in Python by pycaret.

Regression in PyCaret

A supervised machine learning module called pycaret regression is used to forecast continuous values/outcomes utilizing a variety of methods and algorithms. Regression can be used to forecast continuous numbers like sales, units sold, temperature, or any other value or outcome.

The regression module in pycaret offers ten graphs and more than 25 algorithms for analyzing model performance. The pycaret regression module has it all, including advanced methods like stacking, ensembling, and hyperparameter tuning.

We'll go through all the steps to successfully implement the regression model in pycaret.

Installation

Follow the steps below to install pycaret:

# create a conda environment
conda create --name yourenvname python=3.8
# activate conda environment
conda activate yourenvname
# install pycaret
pip install pycaret
# create notebook kernel
python -m ipykernel install --user --name yourenvname --display-name "display-name"

Import the library

After installation, we head over to our notebook and import the library.

#importing the pycaret library for our project
from pycaret.regression import *

We'll work mostly with the pycaret.regression class, we need to import all of its dependencies.

Load the dataset

main.py
sample_submission.csv
train.csv
test.csv
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
sample_data= pd.read_csv('sample_submission.csv')

The train_data, test_data, and sample_data are the three datasets that we'll use in this response. The information includes home prices and a benchmark submission based on a linear regression of the year and month of sale, lot size, and the number of bedrooms.

We can quickly glance through our train dataset:

train_data.head()

Let's get more information about the training dataset:

train_data.info()

Data preprocessing

#Data preprocessing
dmw = setup(data = train_data,
target = 'SalePrice',
numeric_imputation = 'mean',
categorical_features = ['MSZoning','Exterior1st','Exterior2nd','KitchenQual','Functional','SaleType',
'Street','LotShape','LandContour','LotConfig','LandSlope','Neighborhood',
'Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl',
'MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond',
'BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir',
'Electrical','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive',
'SaleCondition'] ,
ignore_features = ['Alley','PoolQC','MiscFeature','Fence','FireplaceQu','Utilities'],
normalize = True,
silent = True)
  • Line 2: We declare a variable dmw with seven parameters, data which holds our training dataset.

  • Line 3: We set target parameter to the SalePrice column, which will be our target column for this short tutorial.

  • Line 4: We set the numeric_imputation parameter to default = ‘mean’. The other available option is median or zero.

  • Line 5-11: We set the categorical_features parameter to a list of other columns in the dataset that will be useful for our ML model, excluding the target column.

  • Line 12: We set the ignore_features to list of columns in the training dataset that we choose to ignore when training the model.

  • Line 13: We set the normalize parameter to True because the purpose of normalization is to rescale the values of the dataset's numeric columns without losing information or distorting the differences between the ranges of values.

  • Line 14: We set the silent parameter's job to True so that it can control the confirmation input of data types when the setup is executed.

Compare different regression models

We can compare different regression models using the compare_model function:

compare_models()

This function uses cross-validation to train and assess the performance of each estimator in the model library. This function produces a scoring grid with typical cross-validated scores as its result.

Create model

After successfully comparing the models, we have to then create a model:

byc = create_model('lightgbm')

This function uses cross-validation to train and assess an estimator's performance. This function produces a scoring grid with CV scores broken down by fold. Using the models feature, we can access every model that is readily available. The lightgbm simply means light gradient boosting machine.

Model tuning

Hyperparameter optimization is another name for model tuning, and we can do that using the tune_model function.

tuned_byc = tune_model(byc)

The tune_model is an optimization function in pycaret. The model's hyperparameters are adjusted by this function. This function produces a scoring grid with cross-validated scores broken down by fold. Based on the measure specified in the optimized parameter, the best model is chosen.

SHapley Additive exPlanations

Any machine learning model's output can be explained using a game theoretic method such as SHAP (SHapley Additive exPlanations).

interpret_model(tuned_byc)

This process examines the predictions made by a trained model. The SHAP (SHapley Additive exPlanations) provides the foundation for the majority of the charts in this function. We must install SHAP to achieve this, we can install this library via conda or pip.

# installing SHAP
conda install -c conda-forge shap
#or
pip install shap

Predictions

Using a trained model on the fresh/unknown dataset, this program generates cluster labels.

predictions = predict_model(tuned_byc, data = test_data)
sample['SalePrice'] = predictions['Label']
sample.to_csv('final_house_price.csv',index=False)
sample.head(10)

We use the predict_model function on the tuned dataset and test_data. To complete the process, we have to save the new predictions as a new .csv file. To check if it worked, We can check our current directory or check from the notebook.

ss = pd.read_csv("final_house_price.csv")
ss.head()

Conclusion

In this answer, we learned how to perform regression using pycaret. We can perform classification, NLP, association rules mining, time series analysis, and so much more with this library.

Free Resources

Copyright ©2026 Educative, Inc. All rights reserved