Trusted answers to developer questions

What are the best practices while using BigQuery?

Get Started With Machine Learning

Learn the fundamentals of Machine Learning with this free course. Future-proof your career by adding ML skills to your toolkit — or prepare to land a job in AI or Data Science.

BigQuery is affordable and fast, but it can quickly consume a lot of processing power, increasing the bills if not used properly. In BigQuery, slots decide the amount of processing power that directly impacts cost. First, let’s understand slots.

Slots

In BigQuery, slots are the computational capacity required to execute a SQL query. Slots play a crucial role in pricing and resource allocation. BigQuery manages the slot allocation, and the number of slots allocated to a job depends on

Query size: The amount of data processed
Complexity: The amount of information shuffled

In supplement to consumption-based pricing, BigQuery also provides flat-rate pricing, that is, you can buy a fixed number of slots over some period which provides more granularity in planning capital.

BigQuery is fast because it uses heavy parallel processing by converting the declarative SQL into stages.

Without further delay, let’s see the best practices in BigQuery for:

Cost controls
Query performance
Storage optimization

Cost controls

BigQuery is a columnar database. In other words, all columns from a table are stored separately at the physical level. Hence, we should always avoid using select * in our queries. This practice will limit the data processed.
For looking at the sample data, we should use the preview option instead of the query. The preview feature in BigQuery is free.
We should use a pricing calculator to check the costs before pushing the query to production. On-demand queries are charged based on the number of bytes read, and we can calculate the price based on bytes read.
We should use dashboard for viewing costs and logs. The dashboard can provide us with good insights on our usage of BigQuery.
Partitioning in BigQuery allows us to scan only the required data and reduce costs.

Query performance

We should avoid querying non-required partitions.
We should use external data sources suitably. There’s no guarantee of good performance when using external sources.
We should avoid excessive use of wildcard tables, for example, FROM bigquery-public-data.noaa_gsod.good*.
JavaScript user-defined functions reduce overall performance. We should avoid them.
Certain joins, such as a cartesian product, can create more output than input, and should be avoided.
We should avoid DML statements with updating or inserting single rows.

Storage optimization

The expiration time for datasets/tables/partitions should be set. It allows us to control storage expenditures and optimize the usage of BigQuery storage.
Using BigQuery to process the data, we can also use it to store in tables and partitions for an extended period. It has lower charges for the data, which is not modified in the last 90 days.
Lastly, we can use the pricing calculator to check storage costs.

RELATED TAGS

serverless

cloud

CONTRIBUTOR

Kedar Kodgire

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments