Exercise: Generating and Modeling Synthetic Classification Data
Learn how overfitting happens by using a synthetic dataset with many candidate features and relatively few samples.
We'll cover the following
Overfitting in binary classification
Consider yourself in the situation where you are given a binary classification dataset with many candidate features (200), where you don’t have time to look through all of them individually. It’s possible that some of these features are highly correlated or related in some other way. However, with this many variables, it can be difficult to effectively explore all of them. Additionally, the dataset has relatively few samples: only 1,000. We are going to generate this challenging dataset by using a feature of scikit-learn that allows you to create synthetic datasets for making conceptual explorations such as this. Perform the following steps to complete the exercise:
-
Import the
make_classification
,train_test_split
,LogisticRegression
, androc_auc_score
classes using the following code:from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score
Notice that we’ve imported several familiar classes from scikit-learn, in addition to a new one that we haven’t seen before:
make_classification
. This class does just what its name indicates—it makes data for a classification problem. Using the various keyword arguments, you can specify how many samples and features to include, and how many classes the response variable will have. There is also a range of other options that effectively control how “easy” the problem will be to solve.Note: For more information, refer to the scikit-learn documentation on
make_classification
. Suffice to say that we’ve selected options here that make a reasonably easy-to-solve problem, with some curveballs thrown in. In other words, we expect high model performance, but we’ll have to work a little bit to get it. -
Generate a dataset with two variables,
x_synthetic
andy_synthetic
. The variablex_ synthetic
has the 200 candidate features, andy_synthetic
has the response variable, for all 1,000 samples. Use the following code:X_synthetic, y_synthetic = make_classification(n_samples=1000, n_features=200, n_informative=3, n_redundant=10, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=0.8, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=24)
-
Examine the shape of the dataset and the class fraction of the response variable using the following code:
print(X_synthetic.shape, y_synthetic.shape) print(np.mean(y_synthetic))
You will obtain the following output:
(1000, 200) (1000,) 0.501
After checking the shape of the output, note that we’ve generated an almost perfectly balanced dataset: close to a 50/50 class balance. It is also important to note that we’ve generated all the features so that they have the same
shift
andscale
—that is, a mean of 0 with a standard deviation of 1. Making sure that the features are on the same scale, or have roughly the same range of values, is a key point for using regularization methods—and we’ll see why later. If the features in a raw dataset are on widely different scales, it is advisable to normalize them so that they are on the same scale. Scikit-learn has the functionality to make this easy, which we’ll learn about in the challenge at the end of this section. -
Plot the first few features as histograms to show that the range of values is the same using the following code:
for plot_index in range(4): plt.subplot(2, 2, plot_index+1) plt.hist(X_synthetic[:, plot_index]) plt.title('Histogram for feature {}'.format(plot_index+1)) plt.tight_layout()
You will obtain the following output:
Get hands-on with 1400+ tech skills courses.