...

Undersampling

The purpose of this lesson is to cover a range of ways to balance your dataset by using a technique called undersampling, alongside some code snippets and a few illustrations.

Python 3.8

from tabulate import tabulate
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
# generate dataset
x_data, y_data = make_classification(n_samples=10000, n_features=3, n_redundant=0, n_clusters_per_class=1, weights=[0.95], flip_y=0, random_state=1)
data = pd.DataFrame({'Feature 1':x_data[:,0],'Feature 2':x_data[:,1],'Feature 3':x_data[:,2], 'Target' : y_data})
# create the random under sampler.
under_sampler = RandomUnderSampler()
# fit the object to the training data.
x_data_under, y_data_under = under_sampler.fit_resample(x_data, y_data)
print('shape of the previous dataset: ', x_data.shape)
print('shape of the new dataset: ', x_data_under.shape)
# show the result visually
sns.scatterplot(x_data_under[:, 0], x_data_under[:, 1], hue = y_data_under)
plt.title("RandomUnderSampler", size=24)
plt.savefig('output/random.png')
# plot the original data
plt.clf()
sns.scatterplot(x_data[:, 0], x_data[:, 1], hue = y_data)
plt.title("Original Data", size=24)
plt.savefig('output/original.png')

Here is the set of parameters you can specify to the RandomUnderSampler object (the same applies to the other objects from the imblearn library):

sampling_strategy: This parameter is used to define how to perform undersampling on our dataset. ‘Majority’ to resample only the majority class, ‘not_minority’ to resample all classes except the minority class, and ‘auto’, which is the default, and it refers to ‘not_minority’.
return_indices: A Boolean on whether to return the indices of the eliminated records or not.
random_state: An integer that controls the randomness of the method, allowing you to reproduce the results.

NearMiss undersampling

Near Miss refers to a collection of undersampling methods that select ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

Undersampling

Random undersampling

NearMiss undersampling