Design of a Distributed Search

Get an overview of the design of a distributed search system that manages a large number of queries per second.

We'll cover the following...

High-level design
API design
Detailed discussion
Summary

The crawler collects content from the intended resource. For example, if we build a search for a YouTube application, the crawler will crawl through all of the videos on YouTube and extract textual content for each video. The content could be the title of the video, its description, the channel name, or maybe even the video’s annotation to enable an intelligent search based not only on the title and description but also on the content of that video. The crawler formats the extracted content for each video in a JSON document and stores these JSON documents in a distributed storage.
The indexer fetches the documents from a distributed storage and indexes these documents using MapReduceAs stated by Wikipedia, “MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster of commodity machines.”, which runs on a distributed cluster of commodity machines. The indexer uses a distributed data processing system like MapReduce for parallel and distributed index construction. The constructed index tableThe index table consists of terms and their mappings. is stored in the distributed storage.
The distributed storage is used to store the documents and the index.
The user enters the search string that contains multiple words in the search bar.
The searcher parses the search string, searches for the mappings from the index that are stored in the distributed storage, and returns the most matched results to the user. The searcher intelligently maps the incorrectly spelled words in the search string to the closest vocabulary words. It also looks for the documents that include all the words and ranks them.

API design

...

Design of a Distributed Search

High-level design

API design