How are vector clocks used in Dynamo?

Ordering events

A single machine is the absolute or wall clock time:

Suppose we perform a write to key k with timestamp t1 and then perform another write to k with timestamp t2. Since t2 > t1, the second write must be newer than the first write, and therefore the database can safely overwrite the original value.

In a distributed system, this assumption does not hold. The problem is clock skew, such as, different clocks tend to run at different rates, so we cannot assume that time t on node a happened before time t + 1 on node b.

The most practical techniques that help with synchronizing clocks, like NTPNTP stands for Network Time Protocol and is used to synchronize clock time between computer systems in a network. NTP is one of the oldest parts of the TCP/IP protocol suite., still do not guarantee that every clock in a distributed system is synchronized at all times. So, using just the wall clock timestamps is not enough without special hardware like GPS units and atomic clocks.

Vector clocks in Dynamo

Instead of employing tight synchronization mechanics, Dynamo uses vector clocks to capture causality between different versions of the same object. A vector clock is effectively a (node, counter) pair. Every version of an object stored in Dynamo is associated with a vector clock. To determine causality between two versions, one can examine their vector clocks. If the first object’s counters are strictly less than or equal to the second clock, the first object happened before the second. Otherwise, it suggests that the vector clocks do not have a causal connection and require further conflict reconciliation.

Dynamo resolves these conflicts at read-time. Let’s understand this with an example:

Server A serves a write to key k1, with the value foo. It assigns it a version of [A:1]. This write gets replicated to server B.
Server A serves a write to key k1, with value bar. It assigns it a version of [A:2]. This write also gets replicated to server B.
A network partition occurs. A and B cannot talk to each other.
Server A serves a write to key k1, with the value baz. It assigns it a version of [A:3]. It cannot replicate it to server B, but it gets stored in a hinted handoff buffer on another server.
Server B sees a write to key k1, with the bax value. It assigns it a version of [B:1]. It cannot replicate it to server A, but it gets stored in a hinted handoff buffer on another server.
The network heals. Server A and B can talk to each other again.
Either server gets a read request for key k1. It sees the same key with different versions [A:3] and [A:2][B:1], but it does not know which one is newer. It returns both and tells the client to figure out the version and write the newer version back into the system.

In the above example, most of the time, new versions subsume the previous version(s), and the system itself can determine the correct version (For example, [A:2] is newer than [A:1]). However, there are cases where the system cannot conclude on the correct version among the multiple versions of the same object, so this responsibility is given to the client. This is known as semantic reconciliation.

Such a situation can arise in the presence of node failures, combined with concurrent updates leading to conflicting versions of an object. An example of semantic reconciliation is the shopping cart feature provided by Amazon. This mechanism guarantees that an ‘Add to cart’ operation is never lost, but it is possible that deleted items will resurface.

Free Resources

How are vector clocks used in Dynamo?

Overview

Ordering events

Vector clocks in Dynamo