Design of a Distributed Search
Define the architecture of a high-volume distributed search system that separates offline indexing from online searching. Learn how document partitioning and MapReduce facilitate parallel index construction and query processing. Implement replication strategies to achieve fault tolerance and scale system availability globally.
High-level design
A distributed search system operates in two phases. The offline phase handles data crawling and indexing in the background. The online phase serves user search queries.
The crawler collects content from sources (e.g., YouTube videos), extracts text and metadata to enable intelligent search based on titles, descriptions, and video content, formats it as JSON, and stores it in a distributed storage system.
The indexer fetches documents and builds indexes using
on a distributed cluster. The resultingMapReduce As stated by Wikipedia, “MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster of commodity machines.” is stored in a distributed storage system.index table The index table consists of terms and their mappings. Distributed storage holds the raw documents and the constructed index.
The user submits a search string containing one or more words.
The searcher parses the query, maps terms to the index, and returns ranked results. It also handles spell correction and relevance ranking.
API design
The API is straightforward since users send string requests.
Search: The search function executes when a user queries the system.
search(query)
Parameter | Description |
| This is the textual query entered by the user in the search bar, based on which the results are found. |
Detailed discussion
As the indexer is the core component of a search system, we discussed indexing techniques and the problems associated with centralized indexing in the previous lesson. Let’s now consider a distributed solution for indexing and searching.
Distributed indexing and searching
The input to the indexing system consists of documents collected during crawling. To build the index at scale, the system uses a cluster of commodity nodes and partitions documents across them according to capacity and load.
To achieve cost efficiency, we split documents across many small nodes. We must decide how to partition this data. Two common techniques are:
Document partitioning: Documents are split into subsets. Each node indexes its assigned subset.
Term partitioning: The dictionary of terms is split. A single node handles a specific set of terms (e.g., all terms starting with “search”).
Term partitioning sends queries only to nodes handling specific terms. While this increases concurrency, multi-word queries require expensive data transfer between nodes to merge results. ...