Data Partitioning

We'll cover the following...

Why do we partition data?
Sharding
Request routing
- ZooKeeper
Conclusion

Why do we partition data?

Data is a critical asset for any organization. As data volumes and concurrent traffic grow, traditional single-node databases reach scalability limits, leading to degraded latency and throughput. While traditional databases offer robust features, such as range queriesA range query is a common database operation that retrieves all records where some value is between an upper and lower boundary., secondary indexesA secondary index is a way to efficiently access records in a database by means of some piece of information other than the primary key., and transactionsA transaction is a single logical unit of work that accesses and possibly modifies the contents of a database. with ACID properties, maintaining these in a distributed environment is challenging.

At some point, a single node becomes a bottleneck. We need to distribute data across multiple nodes while preserving the guarantees of the relational model. Migrating to a NoSQL system is one option, but it can be costly because many existing codebases are tightly coupled to relational databases. Similarly, adopting third-party scaling tools can add operational and architectural complexity.

Data partitioning (or sharding) offers a way to optimize for the specific problem. It enables us to distribute data across multiple nodes, with subsets of the data managed on each node. The goal is to achieve balanced partitions that efficiently handle increasing query rates and data volume.

In this lesson, we will discuss partitioning strategies, their challenges, and solutions.

Sharding

To divide the load among multiple nodes, we use partitioning or sharding. This process splits a large dataset into smaller chunks stored across different nodes in the network.

Partitioning must be balanced. If one partition receives significantly more data or queries, it becomes a bottleneck, known as a hotspot. This degrades the efficacy of the system, as a majority of traffic is routed to a single congested node. We generally use two methods to shard data:

Vertical sharding
Horizontal sharding

Vertical sharding

Vertical sharding involves moving specific tables or columns to different database instances or physical servers. This is often used to separate columns containing large text or binary data (blobs) from the main table to improve retrieval speed.

For example, if an Employee table contains a large photo blob, we can split it into two tables: a lighter Employee table (metadata) and an EmployeePicture table (blob data). As shown in the figure below, both tables retain the primary key EmployeeID to allow efficient reconstruction of the data.

Vertical sharding is often manual and static. In contrast, horizontal sharding is better suited for automation and dynamic scaling.