Search⌘ K
AI Features

Data Partitioning and Replication

Explore data partitioning and replication techniques to manage scalability and fault tolerance in distributed systems. Understand various partitioning strategies and replication models, their trade-offs, and how they combine to optimize system performance and availability.

In the previous lesson, we introduced scaling strategies as the response to growth, focusing on two core approaches: vertical scaling (scaling up a single machine) and horizontal scaling (scaling out across multiple machines). These form the foundation for all capacity-related design decisions.

In this lesson, we build on that foundation by exploring what happens after horizontal scaling the compute layer, and how the data layer must evolve using partitioning and replication to handle continued scale, reliability, and performance demands.

Data partitioning strategies

Partitioning divides a dataset so that each node owns a distinct subset of rows or documents. The partitioning method determines how evenly the load is distributed and how efficiently different query patterns execute.

A database with two partitions to distribute the data and associated read/write load
A database with two partitions to distribute the data and associated read/write load

Several strategies exist, each with distinct trade-offs in distribution uniformity, query flexibility, and operational complexity.

  • Range-based partitioning: The data is split based on a range that does not overlap. Old partitions can easily be archived to serve queries for newer ranges more efficiently. This approach is simple to implement and supports efficient range queries, but it is vulnerable to hotspotsA hotspot refers to a specific partition that experiences disproportionately high traffic, read or write requests, compared to other partitions, causing performance bottlenecks. when access patterns cluster around specific ranges.

Invoice table is horizontally partitioned using Customer_Id
Invoice table is horizontally partitioned using Customer_Id
  • Hash-based partitioning: A hash function is applied to each key and assigns the result to a shard. This distributes keys uniformly, reducing hotspot risk significantly. The trade-off is that range queries ...