Working with Time Issues


Some distributed systems may rely on node clocks to be synchronized to function. It is easy for unsynchronized clocks or misconfigured NTP clients to go undetected for long periods of time, since a drifting clock is unlikely to cause a system crash but a subtle and gradual data loss. For systems that must work with synchronized clocks, it is important that a check is kept on the offset of clocks for all nodes in a cluster. If any one of the clocks drifts too far away from the rest of clocks in the system, then the node with the offending clock should be considered dead and removed from the cluster.

To see how clocks in a distributed system can cause data loss, we’ll work through an example of a database that is written to by multiple clients and the strategy used to handle conflicts is last write wins. The node in the database that receives the write request assigns the timestamp to the write request and not the client which sends the write request. The “last” write is determined based on the timestamps of the write requests. The database node uses the time of day clock to assign the timestamp when recording the write. The writes to the system are ordered using the timestamps, even though the timestamps may have been from different nodes of the distributed database. Let’s see how such an attempt to order writes by timestamp is prone to data loss.

Say, the database comprises of three nodes and the clocks for each of the nodes are slightly ahead or behind of each other. In the following sequence of events we’ll denote the time of each node’s clock as nodex(hour:milliseconds_past_the_hourt) e.g. node1(5:61000), would mean that node1’s time of day clock has a reading of 5:01:01 AM. Given this context, consider two clients that attempt to write the same value to the database for a key K.

  1. At the start the three nodes have their clocks read as:

  2. node1(3:5510)

  3. node2(3:6000)

  4. node3(3:5002)

  5. Client1 sends a write request K=5 to node1, which is recorded at node1 with the timestamp 3:5510. We’ll represent the write as a tuple (K=5, 3:5510). The timestamp in the tuple is that of the node recording

  6. Node1 replicates client1’s write to other nodes. It sends the tuple (K=5, 3:5510) to nodes 2 and 3.

  7. Node3 receives the tuple from node1 and records it but node2 doesn’t receive the tuple due to a network delay just yet. Furthermore, say network delay takes 3 milliseconds so that node3 receives the tuple from node1 at 3:5005 according to its own internal clock.

  8. Client2 initiates a write K=7 that gets routed to node3 and is recorded as (K=7,5008). Note that at this point, the internal clock of node1 reads 3:5516. Node3 will overwrite the key K’s value from 5 to 7 and propagate the change to the other nodes as the tuple (K=7, 5008)

  9. Node 2 receives both the changes (K=5, 3:5510) and (K=7, 5008) and it applies the last write wins LWW strategy and keeps (K=5, 3:5510) on account of a later timestamp and discards (K=7, 5008). This is clearly incorrect since the change on node3 occurs AFTER the change on node1 on the absolute timeline.

  10. Node 1 receives (K=7, 5008) and discards the tuple because the timestamp is lower than the timestamp for the write node1 processed for client1.

Get hands-on with 1200+ tech skills courses.