...

Data Replication

Understand the models through which data is replicated across several nodes.

Data is a valuable asset for an organization because it drives the entire business.

Data provides critical business insights into what’s important and what needs to be changed. Organizations also need to securely save and serve their clients’ data on demand. Timely access to the required data under varying conditions (including increasing reads and writes, disk and node failures, network and power outages, and others) is necessary to successfully run an online business.

We need the following characteristics from our data store:

Availability under faults (failure of one or more disks, nodes, network, or power outages).
Scalability (with increasing reads, writes, and other operations).
Performance (low latency and high throughput for the clients).

It’s challenging, or even impossible, to achieve the above characteristics on a single node.

Replication

Replication refers to keeping multiple copies of the data at various nodes (preferably geographically distributed) to achieve availability, scalability, and performance. In this lesson, we assume that a single node is enough to hold our entire data. We won’t use this assumption while discussing the partitioning of data in multiple nodes. Often, the concepts of replication and partitioning go together.

Replication provides several key benefits for distributed systems:

Keeps data geographically close to your users, thus reducing access latency.
Allows the system to continue working even if some of its parts have failed, thus increasing availability.
Scales out the number of machines that can serve read queries, thus increasing read throughput.

However, with many benefits, like availability, replication comes with its complexities. Replication is relatively simple if the replicated data doesn’t require frequent changes. The main problem in replication arises when we have to maintain changes in the replicated data over time.

If the data that you are replicating does not change over time, then the replication process is a one-time thing. Frequently changing data is a real challenge; it requires careful thinking about concurrency and all the things that can go wrong so that we can deal with the consequences of those faults.

Additional complexities that could arise due to replication are as follows:

How do we keep multiple copies of data consistent with each other?
How do we deal with failed replica nodes?
Should we replicate synchronously or asynchronously?
- How do we deal with replication lag in case of asynchronous replication?
How do we handle concurrent writes?
What consistency model needs to be exposed to the end programmers?

We’ll explore the answer to these questions in this lesson.

Before we explain the different types of replication, let’s understand the synchronous and asynchronous approaches of replication.

Synchronous vs. asynchronous replication

There are two ways to disseminate changes to the replica nodes:

Synchronous replication
Asynchronous replication

In synchronous replication, the primary node waits for acknowledgments from secondary nodes to confirm that the data has been updated.

After receiving acknowledgment from all secondary nodes, the primary node reports success to the client. In asynchronous replication, the primary node doesn’t wait for acknowledgments from the secondary nodes and reports success to the client after updating itself. The advantage of synchronous replication is that all secondary nodes are completely up-to-date with the primary node.

However, there’s a disadvantage to this approach.

If one of the secondary nodes fails to acknowledge due to a network failure or fault, the primary node will be unable to acknowledge the client until it receives a successful acknowledgment from the crashed node. This causes high latency in the response from the primary node to the client. On the other hand, the advantage of asynchronous replication is that the primary node can continue its work even if all the secondary nodes are down.

However, if the primary node fails, the writes that weren’t copied to the secondary nodes will be lost.

Often, leader-based replication is configured to be completely asynchronous. In this case, if the master fails and is not recoverable, any writes that have not yet been replicated to slaves are lost. This means that a write is not guaranteed to be durable, and weakening durability is a significant trade-off that must be carefully considered. The above paragraph explains a trade-off between data consistency and availability when different components of the system can fail.