Scikit-learn cheat sheet: methods for classification & regression

Table of Contents

Learn how to use scikit-learn in your ML projects.Refresher on Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning How to implement classification and regression Loading the Libraries Loading the Dataset Splitting into Train & Test set Training the model Evaluating the model 10 popular classification methods Logistic Regression Support Vector Machine Naive Bayes (Gaussian, Multinomial)Stochastic Gradient Descent Classifier KNN (k-nearest neighbor)Decision Tree Random Forest Gradient Boosting Classifier LGBM Classifier XGBoost Classifier 10 popular regression methods Linear Regression LGBM Regressor XGBoost Regressor CatBoost Regressor Stochastic Gradient Descent Regression Kernel Ridge Regression Elastic Net Regression Bayesian Ridge Regression Gradient Boosting Regression Support Vector Machine Which scikit-learn algorithm should I use?Classification algorithm guide Classification decision guide Regression algorithm guide Regression decision guide Recommended default algorithms Common beginner mistakes Using complex models too early Ignoring feature engineering Using accuracy alone Skipping cross-validation Practical recommendations Final takeaway What to learn next Continue reading about ML and scikit-learn

Home/

Blog/

Data Science/

Scikit-learn cheat sheet: methods for classification & regression

8 mins read

Jun 10, 2026

Machine Learning is a fast-growing technology in today’s world. Machine learning is already integrated into our daily lives with tools like face recognition, home assistants, resume scanners, and self-driving cars.

Scikit-learn is the most popular Python library for performing classification, regression, and clustering algorithms. It is an essential part of other Python data science libraries like matplotlib, NumPy (for graphs and visualization), and SciPy (for mathematics).

In our last article on Scikit-learn, we introduced the basics of this library alongside the most common operations. Today, we take our Scikit-learn knowledge one step further and teach you how to perform classification and regression, followed by the 10 most popular methods for each.

Supervised Learning#

In this ML model, our system learns under the supervision of a teacher. The model has both a known input and output used for training. The teacher knows the output during the training process and trains the model to reduce the error in prediction. The two major types of supervised learning methods are Classification and Regression.

Unsupervised Learning#

Unsupervised Learning refers to models where there is no supervisor for the learning process. The model uses just input for training. The output is learned from the inputs only. The major type of unsupervised learning is Clustering, in which we cluster similar things together to find patterns in unlabeled datasets.

Reinforcement Learning#

Reinforcement Learning refers to models that learn to make decisions based on rewards or punishments and tries to maximize the rewards with correct answers. Reinforcement learning is commonly used for gaming algorithms or robotics, where the robot learns by performing tasks and receiving feedback.

In this post I am going to explain the two major methods of Supervised Learning:

Classification: In Classification, the output is discrete data. In simpler words, this means that we are going to categorize data based on certain features. For example, differentiating between Apples and Oranges based on their shapes, color, texture, etc. In this example shape, color and texture are known as features, and the output is “Apple” or “Orange”, which are known as Classes. Since the output is known as classes, the method is called Classification.
Regression: In Regression, the output is continuous data. In this method, we predict the trends of training data based on the features. The result does not belong to a certain category or class, but it gives a numeric output that is a real number. For example, predicting House Prices is based on certain features like size of the house, location of the house, and no. of floors, etc.

Loading the Libraries#

#Numpy deals with large arrays and linear algebra
import numpy as np
# Library for data manipulation and analysis
import pandas as pd 
 
# Metrics for Evaluation of model Accuracy and F1-score
from sklearn.metrics  import f1_score,accuracy_score
 
#Importing the Decision Tree from scikit-learn library
from sklearn.tree import DecisionTreeClassifier
 
# For splitting of data into train and test set
from sklearn.model_selection import train_test_split

Loading the Dataset#

train=pd.read_csv("/input/hcirs-ctf/train.csv")
# read_csv function of pandas reads the data in CSV format
# from path given and stores in the variable named train
# the data type of train is DataFrame

Splitting into Train & Test set#

#first we split our data into input and output
# y is the output and is stored in "Class" column of dataframe
# X contains the other columns and are features or input
y = train.Class
train.drop(['Class'], axis=1, inplace=True)
X = train
 
# Now we split the dataset in train and test part
# here the train set is 75% and test set is 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2)

Training the model#

# Training the model is as simple as this
# Use the function imported above and apply fit() on it
DT= DecisionTreeClassifier()
DT.fit(X_train,y_train)

Evaluating the model#

# We use the predict() on the model to predict the output
pred=DT.predict(X_test)
 
# for classification we use accuracy and F1 score
print(accuracy_score(y_test,pred))
print(f1_score(y_test,pred))
 
# for regression we use R2 score and MAE(mean absolute error)
# all other steps will be same as classification as shown above
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
print(mean_absolute_error(y_test,pred))
print(mean_absolute_error(y_test,pred))

Now that we know the basic steps for Classification and Regression, let’s learn about the top methods for Classification and Regression that you can use in your ML systems. These methods will simplify your ML programming.

Note: Import these methods to use in place of the DecisionTreeClassifier().

10 popular classification methods#

Logistic Regression#

from sklearn.linear_model import LogisticRegression

Support Vector Machine#

from sklearn.svm import SVC

Naive Bayes (Gaussian, Multinomial)#

from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

Stochastic Gradient Descent Classifier#

from sklearn.linear_model import SGDClassifier

KNN (k-nearest neighbor)#

from sklearn.neighbors import KNeighborsClassifier

Decision Tree#

from sklearn.tree import DecisionTreeClassifier

Random Forest#

from sklearn.ensemble import RandomForestClassifier

Gradient Boosting Classifier#

from sklearn.ensemble import GradientBoostingClassifier

LGBM Classifier#

from lightgbm import LGBMClassifier

XGBoost Classifier#

from xgboost.sklearn import XGBClassifier

10 popular regression methods#

Linear Regression#

from sklearn.linear_model import LinearRegression

LGBM Regressor#

from lightgbm import LGBMRegressor

XGBoost Regressor#

from xgboost.sklearn import XGBRegressor

CatBoost Regressor#

from catboost import CatBoostRegressor

Stochastic Gradient Descent Regression#

from sklearn.linear_model import SGDRegressor

Kernel Ridge Regression#

from sklearn.kernel_ridge import KernelRidge

Elastic Net Regression#

from sklearn.linear_model import ElasticNet

Bayesian Ridge Regression#

from sklearn.linear_model import BayesianRidge

Gradient Boosting Regression#

from sklearn.ensemble import GradientBoostingRegressor

Support Vector Machine#

from sklearn.svm import SVR

Which scikit-learn algorithm should I use?#

Choosing a machine learning algorithm can feel overwhelming when you're starting out. Scikit-learn gives you many options, but the right choice depends on your problem type, dataset size, interpretability needs, and performance goals.

A good rule of thumb is to start simple, build a baseline, and then move to more powerful models if the baseline is not good enough. There is no universally best algorithm—the best model is the one that performs well on your data and solves your business problem.

Classification algorithm guide#

Algorithm	Best For	Strengths	Weaknesses	Typical Dataset Size
Logistic Regression	Interpretable classification	Fast, simple, easy to explain	Struggles with complex non-linear patterns	Small to large
Random Forest	Strong general-purpose baseline	Handles non-linear data, robust, little tuning needed	Less interpretable than linear models	Small to large
XGBoost	High-accuracy tabular data	Excellent predictive performance	Requires tuning, external library	Medium to large
LightGBM	Large tabular datasets	Very fast, scalable, strong accuracy	Can overfit if not tuned carefully	Medium to very large
SVM	Small or medium datasets	Works well with clear margins	Slow on large datasets	Small to medium
KNN	Simple similarity-based classification	Easy to understand, no training phase	Slow prediction, sensitive to scaling	Small
Naive Bayes	Text classification	Very fast, works well for spam/NLP tasks	Strong independence assumptions	Small to large

Linear Regression	Simple numeric prediction	Fast, interpretable, great baseline	Assumes mostly linear relationships	Small to large
ElasticNet	Many correlated features	Combines L1 and L2 regularization	Requires tuning alpha and l1_ratio	Small to medium
Random Forest Regressor	Non-linear regression	Handles complex patterns, robust	Can be slower and less interpretable	Small to large
XGBoost Regressor	High-performance tabular regression	Strong accuracy, handles feature interactions	Requires tuning, external library	Medium to large
LightGBM Regressor	Large-scale regression	Fast training, strong performance	Can overfit without tuning	Medium to very large
SVR	Small non-linear datasets	Flexible with kernels	Does not scale well to large datasets	Small to medium

For regression, start with Linear Regression as your baseline. Then try Random Forest Regressor if the data has non-linear patterns. For stronger performance on tabular datasets, test XGBoost Regressor or LightGBM Regressor.

Recommended default algorithms#

For classification, two strong starting points are:

Random Forest
XGBoost

Random Forest is beginner-friendly because it works well with minimal tuning and handles many real-world datasets effectively. XGBoost often performs better on structured tabular data, but it usually requires more tuning.

For regression, good defaults are:

Random Forest Regressor
XGBoost Regressor

These models capture non-linear relationships better than simple linear models and often provide strong performance without requiring deep mathematical assumptions.

Common beginner mistakes#

Using complex models too early#

Many beginners jump straight to advanced models before creating a simple baseline. Start with Logistic Regression or Linear Regression first so you know what performance level you need to beat.

Ignoring feature engineering#

A better model cannot always fix poor features. Cleaning data, encoding categorical variables, handling missing values, and scaling features often matter as much as the algorithm itself.

Using accuracy alone#

Accuracy is not always the best metric, especially for imbalanced classification problems. Depending on the task, you may need precision, recall, F1-score, ROC-AUC, MAE, RMSE, or R².

Skipping cross-validation#

Testing on one train/test split can be misleading. Cross-validation gives you a more reliable estimate of how well your model generalizes.

Practical recommendations#

If you're unsure where to start:

Identify whether the problem is classification or regression.
Build a simple baseline model.
Evaluate with the right metric.
Try a stronger model such as Random Forest.
Tune advanced models only after you understand the baseline.
Use cross-validation before trusting results.

Final takeaway#

There is no single best scikit-learn algorithm for every problem. The best choice depends on your data, your goal, and your constraints.

Start simple, establish a baseline, and improve step by step. Model selection should be driven by evidence—not assumptions, popularity, or complexity.

What to learn next#

I hope this short tutorial and cheat sheet is helpful for your scikit-learn journey. These methods will make your data scientist journey much smoother and simpler as you continue to learn these powerful tools. There is still a lot to learn about Scikit-learn and the other Python ML libraries.

As you continue your Scikit-learn journey, here are the next algorithms and topics to learn:

Support Vector machine
Random Forest
Cross-validation techniques
grid_search
fit_transform
n_clusters
n_neighbors
sklearn.grid

To advance your scikit-learn journey, Educative has created the course Hands-on Machine Learning with Scikit-Learn. With in-depth explanations of all the Scikit-learn basics and popular ML algorithms, this course offers everything you need in one place. By the end, you’ll know how and when to use each machine learning algorithm and will have the Scikit skills to stand out to any interviewer.

Happy learning!

Continue reading about ML and scikit-learn#

Written By:

Aman Anand

Free Resources

blog

Julia vs. Python: A comprehensive comparison

blog

R Tutorial: a quick beginner's guide to using R

blog

Kubernetes: A Comprehensive Tutorial for Beginners

Scikit-learn cheat sheet: methods for classification & regression

Learn how to use scikit-learn in your ML projects.#

Refresher on Machine Learning#

Supervised Learning#

Unsupervised Learning#

Reinforcement Learning#

How to implement classification and regression#

Loading the Libraries#

Loading the Dataset#

Splitting into Train & Test set#

Training the model#

Evaluating the model#

10 popular classification methods#

Logistic Regression#

Support Vector Machine#

Naive Bayes (Gaussian, Multinomial)#

Stochastic Gradient Descent Classifier#

KNN (k-nearest neighbor)#

Decision Tree#

Random Forest#

Gradient Boosting Classifier#

LGBM Classifier#

XGBoost Classifier#

10 popular regression methods#

Linear Regression#

LGBM Regressor#

XGBoost Regressor#

CatBoost Regressor#

Stochastic Gradient Descent Regression#

Kernel Ridge Regression#

Elastic Net Regression#

Bayesian Ridge Regression#

Gradient Boosting Regression#

Support Vector Machine#

Which scikit-learn algorithm should I use?#

Classification algorithm guide#

Classification decision guide#

Regression algorithm guide#

Regression decision guide#

Recommended default algorithms#

Common beginner mistakes#

Using complex models too early#

Ignoring feature engineering#

Using accuracy alone#

Skipping cross-validation#

Practical recommendations#

Final takeaway#

What to learn next#

Continue reading about ML and scikit-learn#