Oversampling
This lesson will talk about another technique used to fix imbalanced datasets, known as oversampling. We will also see how this can be achieved by different means and code snippets in practice.
Random oversampling
Random oversampling involves randomly selecting examples from the minority class and adding them to the training dataset. Simply put, it replicates random minority class examples. It is known to increase the possibility of overfitting, which is a significant weakness.
Run this code snippet to apply random oversampling:
Python 3.8
from tabulate import tabulateimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationfrom imblearn.over_sampling import RandomOverSampler# generate datasetx_data, y_data = make_classification(n_samples=10000, n_features=3, n_redundant=0, n_clusters_per_class=1, weights=[0.95], flip_y=0, random_state=1)data = pd.DataFrame({'Feature 1':x_data[:,0],'Feature 2':x_data[:,1],'Feature 3':x_data[:,2], 'Target' : y_data})# create the object.over_sampler = RandomOverSampler()# fit the object to the training data.x_data_under, y_data_under = over_sampler.fit_resample(x_data, y_data)print('shape of the previous dataset: ', x_data.shape)print('shape of the new dataset: ', x_data_under.shape)# show the result visuallysns.scatterplot(x_data_under[:, 0], x_data_under[:, 1], hue = y_data_under)plt.title("RandomOverSampler", size=24)plt.savefig('output/random.png')# plot the original dataplt.clf()sns.scatterplot(x_data[:, 0], x_data[:, 1], hue = y_data)plt.title("Original Data", size=24)plt.savefig('output/original.png')
SMOTE oversampling
SMOTE stands for Synthetic ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy