Log-Structured Merge-Tree

Understand the semantics of the log-structured merge-tree.

Introduction

The Log Structured Merge Tree (LSM) is a disk-resident data structure to persist key-value pairs optimized for write-heavy query patterns.

LSM includes an append-only storage structure as the primary data structure to handle high write volumes. These storage structures are periodically merged and compacted in the background to eliminate duplicate and deleted records. LSM is a prominent data structure used in multiple NoSQL datastores such as Cassandra and Hbase.

Data structure

LSM includes four main data structures:

  • Memtable

  • SSTable

  • WAL

  • Bloom filter

Memtable

Memtable is an in-memory data structure that acts as the frontier for all the client requests handling reads and writes. Memtable implements a balanced binary search tree such as AVL tree, guaranteeing that the time complexity of insert, delete, update, and read requests be O(log2M)O(log_2M).

SSTable

Sorted String Tables (SSTable) are on-disk data structures that are a collection of files referred to as segment files. Each segment file stores multiple key-value pairs sorted by key. These segment files are periodically scanned in the background and merged to create large segment files. The resulting segment files eliminate, duplicate, and deleted records.

Each segment file includes an additional index structure to speed up lookups in the file system implemented in B+ tree.

Since SSTables store key-value pairs sorted by key, the implementation of SSTables groups multiple key-value pairs into a block and compresses them for efficient storage. This compression enables Memtable to persist only spatial keys leading to range scans.

Get hands-on with 1200+ tech skills courses.