What is partitioning and clustering in BigQuery?

Partitioning and clustering are crucial to maximizing BigQuery performance and cost when querying a specific data range. It results in scanning fewer data per query. Let’s have a look at them in detail.

Partitioning

Table partitioning is a method to break larger tables into smaller tables. The different partitions of the table are stored separately at the physical level. They are accessed and managed independently. We create a partitioned table based on a column, also known as the partitioning key.

Let’s consider a table with five columns (c1, c2, c3, c4, c5) and c4 as a partitioning key:

All partitions will have the same structure and columns as the initial table. Partitioning is specified while creating the table. We can also set partition expiration. BigQuery will retain only the data within the partition range.

There are two ways to create a partition table.

Ingestion time partitioned tables

These tables are partitioned based on the data arrival date. In other words, BigQuery loads the data into a particular partition each day.

These tables include a pseudo column - _PARTITIONTIME. This column is not a part of the schema definition but can be used in the queries to access data of a particular partition.

select * from `project.dataset.table` where DATE(_PARTITIONTIME) = "2022-01-15"

Partitioned tables

The partitioned table is based on the particular column. The column will have the data type as a timestamp or a date.

There can be instances when the partition column can have null values for some rows. These rows are available in a separate partition called the __NULL__ partition. Similarly, the __UNPARTITIONED__ partition is for the values outside the date range. While querying partition tables, we have to use the partition column in the “where” clause.

Here are the benefits of a partitioned table:

It enhances the query performance because only particular partitions are scanned/queried instead of the whole table.
BigQuery pricing is based on the amount of data processed in a query. We limit the amount of data processed using a partitioned table. This leads to reduced costs.

Clustering

BigQuery supports clustering over a partitioned table. Clustering on a particular column needs to be filtered out against a key to that particular column in our query. Clustering supports all partitioned table types discussed above. Let’s use the same table from the previously discussed example:

Let’s assume that the query is frequently built by specifying columns c3 and c1 in the same order. Hence, we define the cluster key as c3, c1. BigQuery will store data associated with the keys together. Because of Clustering, BigQuery takes less time to process the data as the required columns are kept together.

Here’s the syntax for creating a clustered table:

CREATE TABLE
`project.dataset.table`
PARTITION BY
DATE (c4)
CLUSTER BY
c3,
c1

We will build a query with the clustering keys in the same order:

SELECT SUM(c2) FROM `project.dataset.table` WHERE c3 = 10000 AND c2 LIKE 'GCP*'

Clustering improves efficiency, but there are some limitations:

Clustering is only supported for partitioned tables.
We can specify the clustering column only while creating a table. We can’t modify it later.
We can specify a maximum of four non-repeated columns for clustering.
Clustering can only be used with standard SQL.

You can try BigQuery for free using a sandbox. Here are some examples of creating partitioned and clustered tables.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)