Example: Train/Test Split with SQL
Learn to split a dataset for testing and training using SQL.
We'll cover the following...
A common task when analyzing data is to split a dataset for training and testing. The training dataset is used to train the model, and the test dataset is used to evaluate the model.
Creating a transactions table
To practice what we’ve learned so far, let’s start by generating a transaction table with some random data:
The transaction table includes the date and amount of the transaction, and an indication whether the transaction was reported as fraudulent.
To generate the table, use the techniques we learned in this chapter:
Produce a random date in 2021 by adding a random number of days between 1 and 365 to January 1st, 2021.
Next, produce a random charged amount between 11 and 100:
Finally, produce the parameter we want to predict. Let’s see whether the transaction will be reported as a fraud or not:
In our fake data, we want to have 40% fraudulent transactions. Using an expression we produce a boolean value which will evaluate to true ~40% of the times.
This is what the data looks like:
To test a model that classifies transactions as fraudulent, we want to create two tables—one for training and another for testing.
Splitting the data to test and train
To create a table similar to an existing table in PostgreSQL, we can use the following commands:
This syntax is really useful. We simply tell PostgreSQL to create a table similar to another table, but with no data. In this case, we created two tables, transaction_train and transaction_test with a similar structure to transaction, but with no data.
Next, we want to split the data in the transaction table between transaction_training and transaction_test. We want our training set to include 80% of the rows—in this case 8 rows:
To populate data for training we select from the transaction table, use ORDER BY random() to shuffle the rows, and then insert into transaction_training just the first 8 rows.
To insert only the remaining rows into the test table, we keep the IDs of the training rows by specifying RETURNING id in a common table expression. We then insert rows into transaction_test and exclude rows in training_transaction_ids.
This is the result:
And there you have it, a training dataset and a test dataset with SQL, directly in the database!