Hash Index

Learn the mechanics of the basic data structure hash index.

Introduction

A hash index is a widely used basic data structure that provides fast key-value lookups. A hash index is comparable to a dictionary or a hashmap in programming languages, where the key acts as a lookup index pointing to the location where the value is stored. The search and insertion time complexity of hash index is always O(1)O(1).

Many databases use the hash index as auxiliary storage structures for lookup operations with other data structures such as B+ tree. As a result, hash indexes are used in databases like MySQL and Postgresql.

Data structure

Hash indexes have two primary data structures:

  • In-memory HashMap

  • Disk-resident append-only storage structure

HashMap

A hash index includes a HashMap data structure in the main memory that maintains a mapping of keys to the corresponding value offsets. The value offset refers to the actual location of the data on disk. Hash index requires that all the keys fit inside the RAM. However, since the database loads the value from the disk, the actual value size is immaterial for the HashMap.

Storage structures

Storage structures are disk-resident, append-only data structures that write every modification request to a sequential log file called a segment file.

A single segment file is not scalable. Over time, the file grows exponentially, causing it to run out of disk space. A single file approach also challenges disk seeks on sufficiently large files. Therefore, a monolith segment file is broken down into multiple smaller segment file chunks of fixed size.

The collection of segment files forms the on-disk storage structure. Every storage structure has one active segment file which receives writes. When the active segment file size exceeds a fixed threshold, the active segment file is closed, and a new file for writes is created. Accordingly, HashMap is updated to have both segment file metadata with the value offset.

The adoption of multiple segment files creates a different problem, where multiple segment files have redundant data over time. The database employs a compaction strategy to tackle this challenge. Periodically, smaller segment files are merged and compacted into larger segment files in the background to remove duplicate entries.

Get hands-on with 1200+ tech skills courses.