...

/

Oversampling

Oversampling

This lesson will talk about another technique used to fix imbalanced datasets, known as oversampling. We will also see how this can be achieved by different means and code snippets in practice.

Random oversampling

Random oversampling involves randomly selecting examples from the minority class and adding them to the training dataset. Simply put, it replicates random minority class examples. It is known to increase the possibility of overfitting, which is a significant weakness.

Run this code snippet to apply random oversampling:

Python 3.8
from tabulate import tabulate
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
# generate dataset
x_data, y_data = make_classification(n_samples=10000, n_features=3, n_redundant=0, n_clusters_per_class=1, weights=[0.95], flip_y=0, random_state=1)
data = pd.DataFrame({'Feature 1':x_data[:,0],'Feature 2':x_data[:,1],'Feature 3':x_data[:,2], 'Target' : y_data})
# create the object.
over_sampler = RandomOverSampler()
# fit the object to the training data.
x_data_under, y_data_under = over_sampler.fit_resample(x_data, y_data)
print('shape of the previous dataset: ', x_data.shape)
print('shape of the new dataset: ', x_data_under.shape)
# show the result visually
sns.scatterplot(x_data_under[:, 0], x_data_under[:, 1], hue = y_data_under)
plt.title("RandomOverSampler", size=24)
plt.savefig('output/random.png')
# plot the original data
plt.clf()
sns.scatterplot(x_data[:, 0], x_data[:, 1], hue = y_data)
plt.title("Original Data", size=24)
plt.savefig('output/original.png')

SMOTE oversampling

SMOTE stands for Synthetic ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy