What is regression in PyCaret?
PyCaret is an open-source library in Python based on a low-code paradigm for the automation of machine-learning workflows.
Importance of PyCaret
The significance of PyCaret is explained below:
Workflows for machine learning are automated via the open-source, Python-based
pycaretmodule. This all-encompassing model management and machine learning system greatly increases output and shortens the experiment cycle.With just a few lines of code instead of hundreds,
pycareta low-code library and other free and open-source machine-learning tools can be used.It is a machine-learning solution by data scientists of all levels who want to work more productively and produce quick prototypes. It also seamlessly connects with a variety of other systems, such as Microsoft Power BI, Tableau, Alteryx, and KNIME.
Machine learning libraries and frameworks, such as XGBoost, Ray, Hyperopt, scikit-learn, LightGBM, CatBoost, spaCy, Optuna, and a few others, are effectively wrapped in Python by
pycaret.
Regression in PyCaret
A supervised machine learning module called pycaret regression is used to forecast continuous values/outcomes utilizing a variety of methods and algorithms. Regression can be used to forecast continuous numbers like sales, units sold, temperature, or any other value or outcome.
The regression module in pycaret offers ten graphs and more than 25 algorithms for analyzing model performance. The pycaret regression module has it all, including advanced methods like stacking, ensembling, and hyperparameter tuning.
We'll go through all the steps to successfully implement the regression model in pycaret.
Installation
Follow the steps below to install pycaret:
# create a conda environmentconda create --name yourenvname python=3.8# activate conda environmentconda activate yourenvname# install pycaretpip install pycaret# create notebook kernelpython -m ipykernel install --user --name yourenvname --display-name "display-name"
Import the library
After installation, we head over to our notebook and import the library.
#importing the pycaret library for our projectfrom pycaret.regression import *
We'll work mostly with the pycaret.regression class, we need to import all of its dependencies.
Load the dataset
train_data = pd.read_csv('train.csv')test_data = pd.read_csv('test.csv')sample_data= pd.read_csv('sample_submission.csv')
The train_data, test_data, and sample_data are the three datasets that we'll use in this response. The information includes home prices and a benchmark submission based on a linear regression of the year and month of sale, lot size, and the number of bedrooms.
We can quickly glance through our train dataset:
train_data.head()
train_data.info()
Data preprocessing
#Data preprocessingdmw = setup(data = train_data,target = 'SalePrice',numeric_imputation = 'mean',categorical_features = ['MSZoning','Exterior1st','Exterior2nd','KitchenQual','Functional','SaleType','Street','LotShape','LandContour','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive','SaleCondition'] ,ignore_features = ['Alley','PoolQC','MiscFeature','Fence','FireplaceQu','Utilities'],normalize = True,silent = True)
Line 2: We declare a variable
dmwwith seven parameters,datawhich holds our training dataset.Line 3: We set
targetparameter to theSalePricecolumn, which will be our target column for this short tutorial.Line 4: We set the
numeric_imputationparameter to default =‘mean’. The other available option is median or zero.Line 5-11: We set the
categorical_featuresparameter to a list of other columns in the dataset that will be useful for our ML model, excluding the target column.Line 12: We set the
ignore_featuresto list of columns in the training dataset that we choose to ignore when training the model.Line 13: We set the
normalizeparameter toTruebecause the purpose of normalization is to rescale the values of the dataset's numeric columns without losing information or distorting the differences between the ranges of values.Line 14: We set the
silentparameter's job toTrueso that it can control the confirmation input of data types when the setup is executed.
Compare different regression models
We can compare different regression models using the compare_model function:
compare_models()
This function uses cross-validation to train and assess the performance of each estimator in the model library. This function produces a scoring grid with typical cross-validated scores as its result.
Create model
After successfully comparing the models, we have to then create a model:
byc = create_model('lightgbm')
This function uses cross-validation to train and assess an estimator's performance. This function produces a scoring grid with CV scores broken down by fold. Using the models feature, we can access every model that is readily available. The lightgbm simply means light gradient boosting machine.
Model tuning
Hyperparameter optimization is another name for model tuning, and we can do that using the tune_model function.
tuned_byc = tune_model(byc)
The tune_model is an optimization function in pycaret. The model's hyperparameters are adjusted by this function. This function produces a scoring grid with cross-validated scores broken down by fold. Based on the measure specified in the optimized parameter, the best model is chosen.
SHapley Additive exPlanations
Any machine learning model's output can be explained using a game theoretic method such as SHAP (SHapley Additive exPlanations).
interpret_model(tuned_byc)
This process examines the predictions made by a trained model. The SHAP (SHapley Additive exPlanations) provides the foundation for the majority of the charts in this function. We must install SHAP to achieve this, we can install this library via conda or pip.
# installing SHAPconda install -c conda-forge shap#orpip install shap
Predictions
Using a trained model on the fresh/unknown dataset, this program generates cluster labels.
predictions = predict_model(tuned_byc, data = test_data)sample['SalePrice'] = predictions['Label']sample.to_csv('final_house_price.csv',index=False)sample.head(10)
We use the predict_model function on the tuned dataset and test_data. To complete the process, we have to save the new predictions as a new .csv file. To check if it worked, We can check our current directory or check from the notebook.
ss = pd.read_csv("final_house_price.csv")ss.head()
Conclusion
In this answer, we learned how to perform regression using pycaret. We can perform classification, NLP, association rules mining, time series analysis, and so much more with this library.
Free Resources