Data Generation
Explore how to create synthetic datasets for linear regression by selecting parameters and adding Gaussian noise using Numpy. Understand the importance of reproducible randomness and proper train-validation splitting. This lesson helps you prepare data to visualize and improve gradient descent model training.
We'll cover the following...
The data generation process
We know our model already. To generate synthetic data for it, we need to pick values for its parameters. In our case, we chose b = 1 and w = 2 (as in, thousands of $).
First, let us generate our feature (x), we use Numpy’s rand method to randomly generate 100 (N) points between 0 and 1.
Then, we plug our feature (x) and our parameters b and w into our equation to compute our labels (y). But we need to add some Gaussian noise (epsilon) as well. Otherwise, our synthetic dataset would be a perfectly straight line. We can generate noise using Numpy’s randn method, which draws samples from a normal distribution (of mean 0 and variance 1), and then ...