Data Generation
Explore how to create synthetic datasets for linear regression by selecting parameters and adding Gaussian noise using Numpy. Understand the importance of reproducible randomness and proper train-validation splitting. This lesson helps you prepare data to visualize and improve gradient descent model training.
We'll cover the following...
The data generation process
We know our model already. To generate synthetic data for it, we need to pick values for its parameters. In our case, we chose b = 1 and w = 2 (as in, thousands of $).
First, let us generate our feature (x), we use Numpy’s rand method to randomly generate 100 (N) points between 0 and 1.
Then, we plug our feature (x) and our parameters b and w into our equation to compute our labels (y). But we need to add some Gaussian noise (epsilon) as well. Otherwise, our synthetic dataset would be a perfectly straight line. We can generate noise using Numpy’s randn method, which draws samples from a normal distribution (of mean 0 and variance 1), and then multiplies it by a factor to adjust for the level of noise. Since we do not want to add so much noise, we pick 0.1 as our factor.
Synthetic data generation
The following code generates our synthetic data:
Did you notice the np.random.seed(42) at line 7? This line of code is actually more important than it looks. It guarantees that the same random numbers will be generated every time we run this code.
“Wait, what? Aren’t the numbers supposed to be random? How could they possibly be the same numbers?” You may be asking, and you’re perhaps even a bit annoyed by it.
Splitting data
Next, let us split our synthetic data into train and validation sets, shuffling the array of indexes and using the first 80 shuffled points for training.
“Why do you need to shuffle randomly generated data points? Aren’t they random enough?”
There is one exception to the “always shuffle” rule though, time series problems, where shuffling can lead to data leakage.
Train-validation-test split
It is beyond the scope of this course to explain the reasoning behind the train-validation-test split, but there are two points we would like to make:
-
The split should always be the first thing you do. Neither preprocessing, nor transformations should happen before. Nothing happens before the split, and that is why we do this immediately after the synthetic data generation
-
In this chapter, we will only use the training set. So, we did not bother to create a test set, but we did perform a split nonetheless to highlight point 1.
You can see this in the following code given below:
“Why didn’t we use
train_test_splitfrom Scikit-Learn?”
The following figure shows the subplots of both the training (x_train, y_train) and validation sets (x_val, y_val) of the generated data:
We know that b = 1 and w = 2. But now, let us see how close we can get to the true values by using gradient descent and the 80 points in the training set (for training N = 80).