...

Oversampling

This lesson will talk about another technique used to fix imbalanced datasets, known as oversampling. We will also see how this can be achieved by different means and code snippets in practice.

Python 3.8

from tabulate import tabulate
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
# generate dataset
x_data, y_data = make_classification(n_samples=10000, n_features=3, n_redundant=0, n_clusters_per_class=1, weights=[0.95], flip_y=0, random_state=1)
data = pd.DataFrame({'Feature 1':x_data[:,0],'Feature 2':x_data[:,1],'Feature 3':x_data[:,2], 'Target' : y_data})
# create the object.
over_sampler = RandomOverSampler()
# fit the object to the training data.
x_data_under, y_data_under = over_sampler.fit_resample(x_data, y_data)
print('shape of the previous dataset: ', x_data.shape)
print('shape of the new dataset: ', x_data_under.shape)
# show the result visually
sns.scatterplot(x_data_under[:, 0], x_data_under[:, 1], hue = y_data_under)
plt.title("RandomOverSampler", size=24)
plt.savefig('output/random.png')
# plot the original data
plt.clf()
sns.scatterplot(x_data[:, 0], x_data[:, 1], hue = y_data)
plt.title("Original Data", size=24)
plt.savefig('output/original.png')

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Variable Types

Common Concerns in Datasets

Handling & Imputing Missing Values

Encoding Categorical Variables

Transforming Variables

Variable Discretization

Handling Outliers

Feature Scaling

Engineering Geospatial Data

Handling Date-Time and Mixed Variables

Resampling Imbalanced Data

Advanced Feature Engineering Techniques

Conclusion

Oversampling

Random oversampling

SMOTE oversampling