How to set weights for imbalanced classes

Overview

Classification is a type of supervised machine learning concept, where data is categorized into classes.

In a binary classification example, the data is classified into 2 classes. However, data available for one class can be less, resulting in an imbalanced dataset.

The disadvantage of this is that the model will learn more from the majority class and miss out on the minor one. Although the model might give high accuracy results, its ability to classify the minority class will be impaired.

How to set weights for imbalanced classes

One solution to solving class imbalance is to assign weights. Weights ensure that the model pays more attention to underlying patterns of the minority class and therefore reduces errors of misclassification.

Use the built-in `class weight` parameter

Most algorithms have a built-in parameter called class weight that can be used to offset the class imbalance. Logistic regression is one such example. By default, this parameter is set to None but can also take the form of a dictionary or balanced.

When set to balanced, the values of y (target) are used to automatically adjust weights inversely proportional to class frequencies in the input data as

n_samples / (n_classes * np.bincount(y))

Where:

n_samples is the total rows in a dataset.
n_classes are the number of classes in a dataset.
np.bincount(y) is the total count of a specific class in that dataset.

A dataset with 1000 rows and 2 classes made of 100 and 900 for the minority and majority class respectively, the weights assigned will be as follows:

$1000/2*100 = 5$

$1000/2*900 = 0.55$

Use a customized weights function

Weights can also be assigned manually, especially if the previous methods’ results are unsatisfactory.

Example

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=42)
# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})
print(round(df.target.value_counts(normalize = True)*100),1)
X = df.drop(columns = 'target')
y = df['target']
X_train,X_test,y_train,y_test = train_test_split(X,
   y,test_size=0.2,random_state=42)
print(X_train.shape,y_train.shape)
model_1 = LogisticRegression(class_weight = 'balanced')
model_1.fit(X_train,y_train)
print(f'Accuracy score on balanced weights: {model_1.score(X_test,y_test)*100:.1f}%')
print(f'F1 score on balanced weights: {f1_score(y_test,model_1.predict(X_test)):.3f}')
conf_matrix_1=confusion_matrix(y_test,model_1.predict(X_test))
print(conf_matrix_1)

def class_weight(labels_dict,mu=0.15):
    total = sum(labels_dict.values())
    keys = labels_dict.keys()
    weights = dict()
    for i in keys:
        score = np.log((mu*total)/float(labels_dict[i]))
        weights[i] = score if score > 1 else 1
    return weights

labels_dict = y.value_counts().to_dict()
weights = class_weight(labels_dict)

print('labels dictionary: ', labels_dict)
print('weights: ',weights)


model = LogisticRegression(class_weight = weights)
model.fit(X_train,y_train)
print(f'model_score for manual weights,{model.score(X_test,y_test)*100:.1f}%')
print(f'F1_score for manual weights {f1_score(y_test,model.predict(X_test)):.2f}')
conf_matrix = confusion_matrix(y_test,model.predict(X_test))
print(conf_matrix)

Explanation

The codes above generates an unbalanced dataset using make_classification with 2 classes each, convert the data into a data frame, split the data into training and testing sets, and use different class weights to train a model.

The first code uses balanced class weights and obtains 80% accuracy but a low f1_score of 0.057.

The second code passes a dictionary as the weight parameter to assign weights and obtains 99% accuracy and an improved f1_score of 0.39. The closer f1_score is to 1, the better the model.

Free Resources

How to set weights for imbalanced classes

Overview

How to set weights for imbalanced classes

Use the built-in class weight parameter

Use a customized weights function

Example

Explanation

Use the built-in `class weight` parameter