Related Tags

python

# How to set weights for imbalanced classes

Joy Kareko

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

### Overview

Classification is a type of supervised machine learning concept, where data is categorized into classes.

In a binary classification example, the data is classified into 2 classes. However, data available for one class can be less, resulting in an imbalanced dataset.

The disadvantage of this is that the model will learn more from the majority class and miss out on the minor one. Although the model might give high accuracy results, its ability to classify the minority class will be impaired.

### How to set weights for imbalanced classes

One solution to solving class imbalance is to assign weights. Weights ensure that the model pays more attention to underlying patterns of the minority class and therefore reduces errors of misclassification.

#### Use the built-in class weight parameter

Most algorithms have a built-in parameter called class weight that can be used to offset the class imbalance. Logistic regression is one such example. By default, this parameter is set to None but can also take the form of a dictionary or balanced.

When set to balanced, the values of y (target) are used to automatically adjust weights inversely proportional to class frequencies in the input data as

n_samples / (n_classes * np.bincount(y))



Where:

• n_samples is the total rows in a dataset.
• n_classes are the number of classes in a dataset.
• np.bincount(y) is the total count of a specific class in that dataset.

A dataset with 1000 rows and 2 classes made of 100 and 900 for the minority and majority class respectively, the weights assigned will be as follows:

$1000/2*100 = 5$

$1000/2*900 = 0.55$

#### Use a customized weights function

Weights can also be assigned manually, especially if the previous methods’ results are unsatisfactory.

### Example

from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_classificationimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_score, accuracy_score, confusion_matrixX, y = make_classification(n_samples=100000, n_features=2, n_informative=2,                           n_redundant=0, n_repeated=0, n_classes=2,                           n_clusters_per_class=1,                           weights=[0.995, 0.005],                           class_sep=0.5, random_state=42)# Convert the data from numpy array to a pandas dataframedf = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})print(round(df.target.value_counts(normalize = True)*100),1)X = df.drop(columns = 'target')y = df['target']X_train,X_test,y_train,y_test = train_test_split(X,   y,test_size=0.2,random_state=42)print(X_train.shape,y_train.shape)model_1 = LogisticRegression(class_weight = 'balanced')model_1.fit(X_train,y_train)print(f'Accuracy score on balanced weights: {model_1.score(X_test,y_test)*100:.1f}%')print(f'F1 score on balanced weights: {f1_score(y_test,model_1.predict(X_test)):.3f}')conf_matrix_1=confusion_matrix(y_test,model_1.predict(X_test))print(conf_matrix_1)
Use balanced weights
def class_weight(labels_dict,mu=0.15):
total = sum(labels_dict.values())
keys = labels_dict.keys()
weights = dict()
for i in keys:
score = np.log((mu*total)/float(labels_dict[i]))
weights[i] = score if score > 1 else 1
return weights

labels_dict = y.value_counts().to_dict()
weights = class_weight(labels_dict)

print('labels dictionary: ', labels_dict)
print('weights: ',weights)

model = LogisticRegression(class_weight = weights)
model.fit(X_train,y_train)
print(f'model_score for manual weights,{model.score(X_test,y_test)*100:.1f}%')
print(f'F1_score for manual weights {f1_score(y_test,model.predict(X_test)):.2f}')
conf_matrix = confusion_matrix(y_test,model.predict(X_test))
print(conf_matrix)
Use a function to manually set weights

### Explanation

The codes above generates an unbalanced dataset using make_classification with 2 classes each, convert the data into a data frame, split the data into training and testing sets, and use different class weights to train a model.

The first code uses balanced class weights and obtains 80% accuracy but a low f1_score of 0.057.

The second code passes a dictionary as the weight parameter to assign weights and obtains 99% accuracy and an improved f1_score of 0.39. The closer f1_score is to 1, the better the model.

Note: There are other ways of dealing with class imbalance such as oversampling, undersampling, and data augmentation.

RELATED TAGS

python

CONTRIBUTOR

Joy Kareko