Indexing in a Distributed Search
Discover the role of indexing in distributed search, focusing on the Inverted Index structure for fast information retrieval. Evaluate the design trade-offs and understand why centralized indexing systems cannot scale to handle massive data sets.
This lesson explores the mechanics of indexing and the transition from centralized to distributed architectures.
Indexing
Indexing organizes data to facilitate fast and accurate information retrieval.
Build a searchable index
A basic approach assigns a unique ID to each document and stores the text in a database table (a forward index). The first column holds the document ID, and the second contains the text.
Simple Document Index
ID | Document Content |
1 | Elasticsearch is the distributed and analytics engine that is based on REST APIs. |
2 | Elasticsearch is a Lucene library-based search engine. |
3 | Elasticsearch is a distributed search and analytics engine built on Apache Lucene. |
In production, documents are significantly larger than single sentences. Storing full text in a table creates a massive dataset. Searching this document-level index is slow because the system must scan every document to count occurrences of the search string.
Note: A
adds complexity. The system must identify unique candidate strings across all documents, determine approximate matches, and locate them, significantly increasing latency. fuzzy search Uses approximate string matching rather than exact matching.
Search query response time depends on:
Data organization strategy
Data volume
Hardware resources (processing speed and RAM)
Scanning billions of documents in a document-level index is inefficient. To reduce ...