Secondary Indexes

We'll cover the following

Document-based partitioning of secondary indexes
Term-based partitioning of secondary indexes

So far we have covered material on partitioning data. If the access patterns for the stored data also use columns/fields other than the primary key, then we may want to create secondary indexes on those columns/fields. Let’s understand the difference between a primary and secondary index first.

A primary index is based on the primary key of a table. The primary key comprises of a set of fields in the table, that together represent a unique value for each record. Furthermore, the records in a table can be thought of as being laid out (or sorted) in the order of the primary key, thus any searches for records using the primary index can be done using binary search. In contrast the the secondary index can be based on any fields of a table that may or may not be unique across all the records.

For instance, in our songs example if the records are often searched by the length of the song, then we can create an index on the column that captures the song’s length in seconds. Note that a secondary index doesn’t uniquely identify a record e.g. there could be several hundred songs that are, say 300 seconds (5 minutes) long. In contrast we can assign a monotonically increasing long id to every song that serves as the primary key and thus makes up the primary index.

The secondary indexes on data that lives on a single node are simple enough but when the data is partitioned there are two ways to create the secondary indexes:

Document-based partitioning of secondary index
Term-based partitioning of secondary index

We discuss them next:

Document-based partitioning of secondary indexes

A secondary index is maintained for the documents/records that live on the same node as the index. A secondary index partitioned by document is also known as a local index as opposed to a global index which we cover next. Each partition is responsible only for the documents that live on that partition and doesn’t care about the documents/records in other partitions.

In our songs example, we could range partition the original data using the filename and then create a secondary index on the length of the song. We could assign the first partition the range [A — E], the next partition, the range [F — J], so on and so forth. Each of these partitions will have their corresponding local index.

Get hands-on with 1200+ tech skills courses.

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Secondary Indexes

Document-based partitioning of secondary indexes