...

/

Undersampling

Undersampling

The purpose of this lesson is to cover a range of ways to balance your dataset by using a technique called undersampling, alongside some code snippets and a few illustrations.

Random undersampling

Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset. This may lead to information loss, but if the examples of the majority class are close to others in terms of distance, this method might yield good results.

Run the following code:

Python 3.8
from tabulate import tabulate
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
# generate dataset
x_data, y_data = make_classification(n_samples=10000, n_features=3, n_redundant=0, n_clusters_per_class=1, weights=[0.95], flip_y=0, random_state=1)
data = pd.DataFrame({'Feature 1':x_data[:,0],'Feature 2':x_data[:,1],'Feature 3':x_data[:,2], 'Target' : y_data})
# create the random under sampler.
under_sampler = RandomUnderSampler()
# fit the object to the training data.
x_data_under, y_data_under = under_sampler.fit_resample(x_data, y_data)
print('shape of the previous dataset: ', x_data.shape)
print('shape of the new dataset: ', x_data_under.shape)
# show the result visually
sns.scatterplot(x_data_under[:, 0], x_data_under[:, 1], hue = y_data_under)
plt.title("RandomUnderSampler", size=24)
plt.savefig('output/random.png')
# plot the original data
plt.clf()
sns.scatterplot(x_data[:, 0], x_data[:, 1], hue = y_data)
plt.title("Original Data", size=24)
plt.savefig('output/original.png')

Here is the set of parameters you can specify to the RandomUnderSampler object (the same applies to the other objects from the imblearn library):

  • sampling_strategy: This parameter is used to define how to perform undersampling on our dataset. ‘Majority’ to resample only the majority class, ‘not_minority’ to resample all classes except the minority class, and ‘auto’, which is the default, and it refers to ‘not_minority’.
  • return_indices: A Boolean on whether to return the indices of the eliminated records or not.
  • random_state: An integer that controls the randomness of the method, allowing you to reproduce the results.

NearMiss undersampling

Near Miss refers to a collection of undersampling methods that select ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy