Search⌘ K
AI Features

Sampling

Explore how to extract data samples efficiently using SQL techniques like LIMIT with random ordering and PostgreSQL's TABLESAMPLE methods such as SYSTEM and BERNOULLI. Understand when to apply each method based on dataset size and performance needs, and learn how to use repeatable sampling with seeds for consistent experimental results. This lesson helps you select the right sampling approach to speed up data analysis on large datasets.

Extracting a small subset of a table is often called sampling. There are various reasons to use sampling, for example:

  1. Performing estimations on large datasets: When working on large tables, we are sometimes willing to compromise accuracy in favor of speed. By sampling a portion of the table we can produce less accurate results more quickly.

  2. Producing a training set: When doing data analysis using machine learning models, it is often necessary to train the model on a portion of the data. This portion is known as a training set. The training set can be produced by sampling the table.

Sampling with LIMIT

A simple way to fetch a random portion of a table is combining random with LIMIT:

PostgreSQL
WITH sample AS (
SELECT *
FROM users
ORDER BY random()
LIMIT 10000
)
SELECT count(*) FROM sample;
count
───────
10000
(1 row)
Time: 205.643 ms

To sample 10,000 random rows from the table users we do the following:

  1. Sort the table in a random order using random().
  2. Take the first 10,000 rows using LIMIT 10000.

This method of sampling forces the database to sort the entire dataset, and then pick the first N rows. This method is fine for small datasets, but for very large datasets it can be very inefficient and might result in high memory consumption and CPU usage.

...