Data Partitioning and Replication
Explore data partitioning and replication techniques to manage scalability and fault tolerance in distributed systems. Understand various partitioning strategies and replication models, their trade-offs, and how they combine to optimize system performance and availability.
In the previous lesson, we introduced scaling strategies as the response to growth, focusing on two core approaches: vertical scaling (scaling up a single machine) and horizontal scaling (scaling out across multiple machines). These form the foundation for all capacity-related design decisions.
In this lesson, we build on that foundation by exploring what happens after horizontal scaling the compute layer, and how the data layer must evolve using partitioning and replication to handle continued scale, reliability, and performance demands.
Data partitioning strategies
Partitioning divides a dataset so that each node owns a distinct subset of rows or documents. The partitioning method determines how evenly the load is distributed and how efficiently different query patterns execute.
Several strategies exist, each with distinct trade-offs in distribution uniformity, query flexibility, and operational complexity.
Range-based partitioning: The data is split based on a range that does not overlap. Old partitions can easily be archived to serve queries for newer ranges more efficiently. This approach is simple to implement and supports efficient range queries, but it is vulnerable to
when access patterns cluster around specific ranges.hotspots A hotspot refers to a specific partition that experiences disproportionately high traffic, read or write requests, compared to other partitions, causing performance bottlenecks.
Hash-based partitioning: A hash function is applied to each key and assigns the result to a shard. This distributes keys uniformly, reducing hotspot risk significantly. The trade-off is that range queries ...