Snapshots
Explore how to create and manage snapshots in dbt to capture historical changes in mutable tables. Learn the differences between timestamp and check strategies, and understand how dbt tracks inserts and updates for effective data versioning.
Mutable tables
Sometimes, rows in a table can be updated or deleted. In that case, it can be tricky to access information based on a previous state of that table.
Imagine we have a table of orders that looks like this:
Orders
|
|
|
1 | Payment Accepted | 2023-01-01 |
2 | Sent | 2023-01-01 |
After some time, we take another look at our table and it now looks like this:
Orders
|
|
|
1 | Sent | 2023-01-02 |
2 | Sent | 2023-01-01 |
3 | Payment Accepted | 2023-01-02 |
If we want to know how long it took for order_id “1” to be delivered, we lost that information. Luckily, dbt provides a feature called snapshots. Every time they are run, snapshots track and record the changes that happened since the last snapshot.
Understanding snapshots
In our case, if we run the snapshot on the first of January and on the second, we would have a snapshot like this:
Orders Snapshot
|
|
|
|
|
|
1 | Payment Accepted | 2023-01-01 | 2023-01-01 | 2023-01-02 | 260b3e575635a5515b8a3af019fa3573 |
1 | Sent | 2023-01-02 | 2023-01-02 | 8361e7812ccc2614533d85b3fe4e6ba1 | |
2 | Sent | 2023-01-01 | 2023-01-02 | e8e880f6f30f74f6743705c4c8c097b3 | |
3 | Payment Accepted | 2023-01-02 | 2023-01-02 | 26d498a4661ec3f4a9571b2a03ce93b2 |
We have two rows for order_id “1” because it has been in two different states: Payment Accepted and Sent.
If we want to have an overview of our orders at a precise date, we can run the following query:
The frequency at which we run our table is crucial for the quality of the snapshot. If a row is changed twice between two snapshot executions, dbt will only be able to record the second change.
Creating a snapshot
✅ It’s considered a best practice to create a snapshot from your source tables, and to select all columns. This will avoid missing data in the future. ...