Often, we have slow-moving data in the distributed databases that need to be separated from the other data. Similarly, there might be some sensitive data like passwords that need extra security controls. Vertical sharding of databases plays a crucial role in all of these cases by separating desired data from the rest of the data or enabling extra security controls over some data.
We can put different tables in different database instances. These databases might be running on a different physical server. We might break a table into multiple tables, such that some columns are in one table, while the rest are in the other. We should be careful when there are joins between multiple tables. We would like to keep such tables together on one shard.
Vertical sharding is used to increase the speed of data retrieval from a table consisting of columns containing very wide text or Binary Large Object (BLOB). In this case, the column with large text or BLOB is split into a different table.
As shown in the following figure, the
Employee table is divided into two tables: a reduced
Employee table and an
EmployeePicture table. The
EmployePicture table has just two columns,
Picture, separated from the original table. Moreover, the primary key,
EmpoloyeeID, of the
Employee table is added in both partitioned tables. This makes the data read and write easier and the reconstruction of the table is performed efficiently.
A well-partitioned database architecture evenly distributes data and the workload across all the distributed nodes. To maintain an expected level of performance consistency, queries are accurately directed to the desired node and the miss rate is minimized. There should be a balance in keeping the number of distributed database nodes. A large number of nodes would unnecessarily consume the database engine resources along with producing contention spots even on large-scale hardware. Therefore, there should be a balance between the commitment to optimize query performance and the goal to consolidate, to utilize resources with a better approach for cost (latency and resource consumption) reduction.
Let's understand the join operation in two different ways in distributed database systems.
In relational algebra, as well as in RDBMS, the join operations are considered the most expensive operations. When joining two relations (or tables), suppose we have table A and B, and every tuple in A is compared with every tuple in B to check if the join condition is satisfied. If the condition is satisfied, the rows are concatenated and copied into the result relation. A distributed join takes two or more partitioned tables and performs the join operations on them using one of the two strategies, explained below.
There are two basic join strategies in partitioned (distributed) tables: join and semi-join. Join tries to directly optimize the ordering of its operation, while semi-join is an alternative approach that tries to minimize the communication costs during join operations.
Join operations minimize the query processing cost (execution time, the number of disk accesses, and the communication costs across different nodes) by optimizing the ordering of join operations of relations. As an example, we assume two relations, A and B, which are stored on two nodes, say node A and node B. Moreover, we assume that relations transfer among different nodes of the distributed system and are performed on a set-at-a-time basis instead of tuple-at-a-time bases.
Let's consider a simple query that involves the joining of two relations, A and B, based on some attribute
Attr or condition, which is stored at nodes A and B, respectively. In performing
The simple join operation
The SQL statement of the join operation explained in the following illustration can be written as:
SELECT A.ID, A.Rank, A.Salary, B.City FROM A JOIN B ON A.salary <= 120K
The join operation is explained in the following illustration.
The main drawback of the simple join strategy is that the entire (resultant) relation has to be transmitted between the sites. The objective of semi-join is to minimize the communication cost by replacing simple join operations of a query with semi-join operations. The join operation between two relations A and B over the column or condition
Attr, which are stored at different sites of the distributed system, can be replaced by semi-join operations as follows:
The title of tables should be descriptive—for example, in the above query the title of table A should be
Employeeand B should be
Salary. However, their titles are kept as A and B to make the relational algebra easy to understand.
The semi-join operation is illustrated in the following sequence of figures.
View all Courses