Data Generation

Learn about the data generation process that is being used.

The data generation process

We know our model already. To generate synthetic data for it, we need to pick values for its parameters. In our case, we chose b = 1 and w = 2 (as in, thousands of $).

First, let us generate our feature (x), we use Numpy’s rand method to randomly generate 100 (N) points between 0 and 1.

Then, we plug our feature (x) and our parameters b and w into our equation to compute our labels (y). But we need to add some Gaussian noise (epsilon) as well. Otherwise, our synthetic dataset would be a perfectly straight line. We can generate noise using Numpy’s randn method, which draws samples from a normal distribution (of mean 0 and variance 1), and then multiplies it by a factor to adjust for the level of noise. Since we do not want to add so much noise, we pick 0.1 as our factor.

Synthetic data generation

The following code generates our synthetic data:

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy