Search⌘ K
AI Features

Dummy Estimators and Handling Imbalance Class Problem

Explore how dummy estimators provide baseline models for classification tasks and understand the impact of imbalanced datasets on model performance. Learn techniques such as oversampling, undersampling, and cost-sensitive learning to effectively address class imbalance problems in machine learning.

Dummy estimators

Dummy estimators help us to define a baseline model for the problem at hand. We saw them in the case of regression problems, too. In the case of classification, we have the following dummy estimators:

  • stratified: It predicts the random class label by respecting the training set class distribution.

  • most_frequent: It always predicts the most common label in the training dataset.

  • prior: It predicts the class that maximizes the class prior.

  • uniform: It generates the predictions uniformly at random.

  • constant: It always predicts the constant label provided by the user.

prior always predicts the class that maximizes the class prior (like most_frequent) and predict_proba returns the class prior.

Python 3.5
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.7)
# Fitting the baseline DummyEstimator
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf.fit(X_train, y_train)
print("The accuracy (DummyClassifier) on test set is {0:.2f}".format(clf.score(X_test, y_test)))
# Fitting the Support Vector Machine
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
print("The accuracy (SVM) on test set is {0:.2f}".format(clf.score(X_test, y_test)))
  • On lines 1–2, we load the necessary modules.
  • On line 3, we load the iris dataset.
  • On line 4, we split the iris dataset into training and test datasets. Note that train_size=0.7 indicates that 70% of the rows are included in the training dataset and 30% in the test dataset.
  • On line 9
...