How to shard a database vertically

Introduction

Often, we have slow-moving data in the distributed databases that need to be separated from the other data. Similarly, there might be some sensitive data like passwords that need extra security controls. Vertical sharding of databases plays a crucial role in all of these cases by separating desired data from the rest of the data or enabling extra security controls over some data.

Vertical sharding

We can put different tables in different database instances. These databases might be running on a different physical server. We might break a table into multiple tables, such that some columns are in one table, while the rest are in the other. We should be careful when there are joins between multiple tables. We would like to keep such tables together on one shard.

Vertical sharding is used to increase the speed of data retrieval from a table consisting of columns containing very wide text or Binary Large Object (BLOB). In this case, the column with large text or BLOB is split into a different table.

As shown in the following figure, the Employee table is divided into two tables: a reduced Employee table and an EmployeePicture table. The EmployePicture table has just two columns, EmployeID and Picture, separated from the original table. Moreover, the primary key, EmpoloyeeID, of the Employee table is added in both partitioned tables. This makes the data read and write easier and the reconstruction of the table is performed efficiently.

A well-partitioned database architecture evenly distributes data and the workload across all the distributed nodes. To maintain an expected level of performance consistency, queries are accurately directed to the desired node and the miss rate is minimized. There should be a balance in keeping the number of distributed database nodes. A large number of nodes would unnecessarily consume the database engine resources along with producing contention spots even on large-scale hardware. Therefore, there should be a balance between the commitment to optimize query performance and the goal to consolidate, to utilize resources with a better approach for cost (latency and resource consumption) reduction.

Let's understand the join operation in two different ways in distributed database systems.

Distributed join operations

In relational algebra, as well as in RDBMS, the join operations are considered the most expensive operations. When joining two relations (or tables), suppose we have table A and B, and every tuple in A is compared with every tuple in B to check if the join condition is satisfied. If the condition is satisfied, the rows are concatenated and copied into the result relation. A distributed join takes two or more partitioned tables and performs the join operations on them using one of the two strategies, explained below.

There are two basic join strategies in partitioned (distributed) tables: join and semi-join. Join tries to directly optimize the ordering of its operation, while semi-join is an alternative approach that tries to minimize the communication costs during join operations.

Join operation

Join operations minimize the query processing cost (execution time, the number of disk accesses, and the communication costs across different nodes) by optimizing the ordering of join operations of relations. As an example, we assume two relations, A and B, which are stored on two nodes, say node A and node B. Moreover, we assume that relations transfer among different nodes of the distributed system and are performed on a set-at-a-time basis instead of tuple-at-a-time bases.

Let's consider a simple query that involves the joining of two relations, A and B, based on some attribute Attr or condition, which is stored at nodes A and B, respectively. In performing $A\Join_{Attr} B$ , the relation with the smaller size needs to be transferred to the node of the relationship with a larger size. Therefore, it is necessary to calculate the size of the relations A and B. If the query involves one more relation C, then the obvious choice is to transfer the intermediate result $A \Join_{Attr} B$ or the relation C, whichever is smaller in size.

The simple join operation $A \Join_{Attr} B$ can be implemented in the following way with the assumption that $size(A) < size(B)$ :

$A \Join_{Attr} B$ :

The relation A is to be transferred to node B.
The join operation is performed at node B.

The SQL statement of the join operation explained in the following illustration can be written as:

$(A\ltimes_{Attr} B) \Join_{Attr}B$ :

The projection operationIn relational algebra, the projection operation returns a specific number of attributes/columns from the table along with the data upon which the projection is applied. $\Pi_{Attr}(B)$ is performed at node B, and the result is sent to node A.
Node A computes $T=(A\ltimes_{Attr} B)$ .
The join operation $T\Join_{Attr}B$ is computed at node B.

The title of tables should be descriptive—for example, in the above query the title of table A should be Employee and B should be Salary. However, their titles are kept as A and B to make the relational algebra easy to understand.

The semi-join operation is illustrated in the following sequence of figures.

How to shard a database vertically

Introduction

Vertical sharding

Distributed join operations

Join operation

Semi-join operation