How are vector clocks used in Dynamo?
Overview
Vector clocks are an extension of Lamport clocks used to capture causality between events in a distributed system. Vector clocks are widely used in large-scale distributed systems to resolve conflicting data.
To enhance performance, Dynamo uses optimistic replication. This means that there is a possibility of conflicting data. Vector clocks are used in Dynamo to resolve multiple conflicting values against the same key.
Ordering events
A single machine is the absolute or wall clock time:
Suppose we perform a write to key k with timestamp t1 and then perform another write to k with timestamp t2. Since t2 > t1, the second write must be newer than the first write, and therefore the database can safely overwrite the original value.
In a distributed system, this assumption does not hold. The problem is clock skew, such as, different clocks tend to run at different rates, so we cannot assume that time t on node a happened before time t + 1 on node b.
The most practical techniques that help with synchronizing clocks, like
Vector clocks in Dynamo
Instead of employing tight synchronization mechanics, Dynamo uses vector clocks to capture causality between different versions of the same object. A vector clock is effectively a (node, counter) pair. Every version of an object stored in Dynamo is associated with a vector clock. To determine causality between two versions, one can examine their vector clocks. If the first object’s counters are strictly less than or equal to the second clock, the first object happened before the second. Otherwise, it suggests that the vector clocks do not have a causal connection and require further conflict reconciliation.
Dynamo resolves these conflicts at read-time. Let’s understand this with an example:
- Server
Aserves a write to keyk1, with the valuefoo. It assigns it a version of[A:1]. This write gets replicated to serverB. - Server
Aserves a write to keyk1, with valuebar. It assigns it a version of[A:2]. This write also gets replicated to serverB. - A network partition occurs.
AandBcannot talk to each other. - Server
Aserves a write to keyk1, with the valuebaz. It assigns it a version of[A:3]. It cannot replicate it to serverB, but it gets stored in a hinted handoff buffer on another server. - Server
Bsees a write to keyk1, with thebaxvalue. It assigns it a version of[B:1]. It cannot replicate it to serverA, but it gets stored in a hinted handoff buffer on another server. - The network heals. Server
AandBcan talk to each other again. - Either server gets a read request for key
k1. It sees the same key with different versions[A:3]and[A:2][B:1], but it does not know which one is newer. It returns both and tells the client to figure out the version and write the newer version back into the system.
In the above example, most of the time, new versions subsume the previous version(s), and the system itself can determine the correct version (For example, [A:2] is newer than [A:1]). However, there are cases where the system cannot conclude on the correct version among the multiple versions of the same object, so this responsibility is given to the client. This is known as semantic reconciliation.
Such a situation can arise in the presence of node failures, combined with concurrent updates leading to conflicting versions of an object. An example of semantic reconciliation is the shopping cart feature provided by Amazon. This mechanism guarantees that an ‘Add to cart’ operation is never lost, but it is possible that deleted items will resurface.
Free Resources