Example: Train/Test Split with SQL

Learn to split a dataset for testing and training using SQL.

A common task when analyzing data is to split a dataset for training and testing. The training dataset is used to train the model, and the test dataset is used to evaluate the model.

Creating a transactions table

To practice what we’ve learned so far, let’s start by generating a transaction table with some random data:

CREATE TABLE transaction AS
SELECT
id,
'2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at,
10 + ceil(90 * random()) as charged_amount,
random() > 0.6 as reported_as_fraud
FROM
generate_series(1, 10) AS id
ORDER BY
1;

The transaction table includes the date and amount of the transaction, and an indication whether the transaction was reported as fraudulent.

To generate the table, use the techniques we learned in this chapter:

SELECT '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at;

Produce a random date in 2021 by adding a random number of days between 1 and 365 to January 1st, 2021.

Next, produce a random charged amount between 11 and 100:

SELECT 10 + ceil(90 * random()) as charged_amount;

Finally, produce the parameter we want to predict. Let’s see whether the transaction will be reported as a fraud or not:

SELECT random() > 0.6 as reported_as_fraud;

In our fake data, we want to have 40% fraudulent transactions. Using an expression we produce a boolean value which will evaluate to true ~40% of the times.

This is what the data looks like:

SELECT * FROM transaction;

To test a model that classifies transactions as fraudulent, we want to create two tables—one for training and another for testing.

Splitting the data to test and train

To create a table similar to an existing table in PostgreSQL, we can use the following commands:

CREATE TABLE transaction_training AS TABLE transaction WITH NO DATA;
CREATE TABLE transaction_test AS TABLE transaction WITH NO DATA;

This syntax is really useful. We simply tell PostgreSQL to create a table similar to another table, but with no data. In this case, we created two tables, transaction_train and transaction_test with a similar structure to transaction, but with no data.

Next, we want to split the data in the transaction table between transaction_training and transaction_test. We want our training set to include 80% of the rows—in this case 8 rows:

WITH
training_transaction_ids AS (
INSERT INTO transaction_training
SELECT * FROM transaction
ORDER BY random() LIMIT 8
RETURNING id
)
INSERT INTO transaction_test
SELECT * FROM transaction
WHERE id NOT IN (
SELECT id
FROM training_transaction_ids
);

To populate data for training we select from the transaction table, use ORDER BY random() to shuffle the rows, and then insert into transaction_training just the first 8 rows.

To insert only the remaining rows into the test table, we keep the IDs of the training rows by specifying RETURNING id in a common table expression. We then insert rows into transaction_test and exclude rows in training_transaction_ids.

This is the result:

SELECT * FROM transaction_training;
SELECT * FROM transaction_test;

And there you have it, a training dataset and a test dataset with SQL, directly in the database!