Trusted answers to developer questions

Joy Kareko

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

**Classification** is a type of supervised machine learning concept, where data is categorized into classes.

In a binary classification example, the data is classified into 2 classes. However, data available for one class can be less, resulting in an imbalanced dataset.

The disadvantage of this is that the model will learn more from the majority class and miss out on the minor one. Although the model might give high accuracy results, its ability to classify the minority class will be impaired.

One solution to solving class imbalance is to assign weights. Weights ensure that the model pays more attention to underlying patterns of the minority class and therefore reduces errors of misclassification.

`class weight`

parameterMost algorithms have a built-in parameter called `class weight`

that can be used to offset the class imbalance.
Logistic regression is one such example. By default, this parameter is set to `None`

but can also take the form of a dictionary or `balanced`

.

When set to `balanced`

, the values of `y`

(target) are used to automatically adjust weights inversely proportional to class frequencies in the input data as

```
n_samples / (n_classes * np.bincount(y))
```

Where:

`n_samples`

is the total rows in a dataset.`n_classes`

are the number of classes in a dataset.`np.bincount(y)`

is the total count of a specific class in that dataset.

A dataset with 1000 rows and 2 classes made of 100 and 900 for the minority and majority class respectively, the weights assigned will be as follows:

$1000/2*100 = 5$

$1000/2*900 = 0.55$

Weights can also be assigned manually, especially if the previous methods’ results are unsatisfactory.

from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import make_classificationimport numpy as npimport pandas as pdfrom sklearn.metrics import f1_score, accuracy_score, confusion_matrixX, y = make_classification(n_samples=100000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2,n_clusters_per_class=1,weights=[0.995, 0.005],class_sep=0.5, random_state=42)# Convert the data from numpy array to a pandas dataframedf = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})print(round(df.target.value_counts(normalize = True)*100),1)X = df.drop(columns = 'target')y = df['target']X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)print(X_train.shape,y_train.shape)model_1 = LogisticRegression(class_weight = 'balanced')model_1.fit(X_train,y_train)print(f'Accuracy score on balanced weights: {model_1.score(X_test,y_test)*100:.1f}%')print(f'F1 score on balanced weights: {f1_score(y_test,model_1.predict(X_test)):.3f}')conf_matrix_1=confusion_matrix(y_test,model_1.predict(X_test))print(conf_matrix_1)

Use balanced weights

def class_weight(labels_dict,mu=0.15):total = sum(labels_dict.values())keys = labels_dict.keys()weights = dict()for i in keys:score = np.log((mu*total)/float(labels_dict[i]))weights[i] = score if score > 1 else 1return weightslabels_dict = y.value_counts().to_dict()weights = class_weight(labels_dict)print('labels dictionary: ', labels_dict)print('weights: ',weights)model = LogisticRegression(class_weight = weights)model.fit(X_train,y_train)print(f'model_score for manual weights,{model.score(X_test,y_test)*100:.1f}%')print(f'F1_score for manual weights {f1_score(y_test,model.predict(X_test)):.2f}')conf_matrix = confusion_matrix(y_test,model.predict(X_test))print(conf_matrix)

Use a function to manually set weights

The codes above generates an unbalanced dataset using `make_classification`

with 2 classes each, convert the data into a data frame, split the data into training and testing sets, and use different class weights to train a model.

The first code uses `balanced`

class weights and obtains 80% accuracy but a low `f1_score`

of `0.057`

.

The second code passes a dictionary as the weight parameter to assign weights and obtains 99% accuracy and an improved `f1_score`

of `0.39`

.
The closer `f1_score`

is to `1`

, the better the model.

Note:There are other ways of dealing with class imbalance such as oversampling, undersampling, and data augmentation.

RELATED TAGS

python

CONTRIBUTOR

Joy Kareko

Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring

Related Courses