...

>

Data Replication

Data Replication

Understand why replication is used to improve availability and read performance in distributed systems. Compare the trade-offs between synchronous and asynchronous replication. Compare single-leader, multi-leader, and leaderless replication models. Learn how quorum reads/writes can help manage consistency and concurrent updates.

Data drives business decisions and operations. Organizations must securely and reliably store and serve client data. To run successfully, systems require timely access to data despite increasing load, hardware failures, or network outages.

We require the following characteristics from a data store:

  • Availability: Resilience against faults (disk, node, network, or power failures).

  • Scalability: Ability to handle increasing reads, writes, and traffic.

  • Performance: Low latency and high throughput.

Achieving these characteristics on a single node is often impossible.

Replication

Replication refers to maintaining multiple copies of data across different nodes, often geographically distributed, to improve availability, scalability, and performance. In this lesson, we assume the entire dataset fits on a single node. This assumption no longer holds when we introduce data partitioning. In production systems, replication and partitioning are typically combined.

Replication offers several benefits in distributed systems:

  • Places data closer to users, reducing latency.

  • Allows the system to operate despite node failures, improving availability.

  • Enables multiple nodes to serve read requests, increasing read throughput.

These benefits come with added complexity. Replication is simple when data changes infrequently. The main challenge arises when updates must be consistently propagated across replicas. For immutable data, replication is a one-time process. Mutable data requires careful handling of concurrency, failures, and inconsistencies.

Additional challenges introduced by replication include:

  • How do we keep multiple copies consistent?

  • How do we handle replica failures?

  • Should replication be synchronous or asynchronous?

    • How do we manage replication lag in asynchronous replication?

  • How do we handle concurrent writes?

  • What consistency guarantees should be exposed to application developers?

We’ll explore these questions in this lesson.

Replication in action
Replication in action

Replication strategies generally fall into two categories based on how changes propagate:

Synchronous vs. asynchronous replication

There are two ways to disseminate changes to replica nodes:

    ...