Sampling

Explore how to extract data samples efficiently using SQL techniques like LIMIT with random ordering and PostgreSQL's TABLESAMPLE methods such as SYSTEM and BERNOULLI. Understand when to apply each method based on dataset size and performance needs, and learn how to use repeatable sampling with seeds for consistent experimental results. This lesson helps you select the right sampling approach to speed up data analysis on large datasets.

We'll cover the following...

Sampling with LIMIT
Using TABLESAMPLE
- The SYSTEM sampling method
- The BERNOULLI sampling method
Repeatable sampling
Performance comparison

Extracting a small subset of a table is often called sampling. There are various reasons to use sampling, for example:

Performing estimations on large datasets: When working on large tables, we are sometimes willing to compromise accuracy in favor of speed. By sampling a portion of the table we can produce less accurate results more quickly.
Producing a training set: When doing data analysis using machine learning models, it is often necessary to train the model on a portion of the data. This portion is known as a training set. The training set can be produced by sampling the table.

Sampling with `LIMIT`

A simple way to fetch a random portion of a table is combining random with LIMIT:

1.Introduction

2.Basic SQL for Data Analysis

3.Descriptive Statistics

4.Grouping and Subtotals

5.Running and Cumulative Aggregation

6.Interpolation

7.Binning

8.Conclusion

Sampling

Sampling with `LIMIT`

Sampling

Sampling with LIMIT

Sampling with `LIMIT`