System Design: The Distributed Search
Discover why modern search systems are essential for navigating massive data volumes. Identify the key components: crawler, indexer, and searcher, that form the foundation of a search engine. Learn the structured, five-step approach for designing a scalable distributed search system.
We'll cover the following...
Why do we need a search system?
Search bars are standard on modern websites because they allow users to filter vast amounts of content instantly. Without search, users would have to manually scroll through paginated lists to find specific items, whether on an education platform or in a store. This manual discovery is inefficient and results in a poor user experience.
Consider platforms like YouTube or Google. With billions of videos and web pages, manual navigation is impossible. Search engines solve this by acting as filters, retrieving relevant information from massive datasets in milliseconds. Behind every search interface lies a complex distributed system designed to handle this scale.
What is a search system?
A search system accepts a user’s text query and returns relevant content within strict latency constraints. It typically consists of three main components:
A crawler, which fetches content and creates
.documents For a search engine, a document consists of the text extracted from a web page. In a movie store’s web page, a document could be a JSON object containing titles, descriptions, and other metadata of the videos upon which we want to perform search queries. The documents could be JSON or any other suitable format. Documents are stored on a distributed storage like S3 or HDFS. An indexer, which builds a searchable index
(typically an inverted index)to organize data for efficient retrieval.A searcher, which executes queries against the index created by the indexer to return results.
Note: We have a separate chapter dedicated to the crawler component. In this chapter, we focus on indexing.
How will we design a distributed search system?
We divide the design process into five lessons:
Requirements: Define functional and non-functional requirements. We also estimate system resources, including servers, storage, and bandwidth.
Indexing: Explore the fundamentals of indexing and examine a centralized architecture for distributed search.
Initial design: Construct a high-level design, define the API, and detail the indexing and searching workflows.
Final design: Evaluate the initial design and refine the architecture to improve scalability.
Evaluation: Assess how the final distributed search system meets the defined requirements.
Let’s start by understanding the requirements.