Scaling Search and Indexing

This lesson introduces an efficient way to scale indexing and search in a search system.

Problem with the proposed design

Although the proposed design in the previous lesson seems reasonable, there are a couple of serious drawbacks that we discuss below.

  1. Colocated indexing and searching: We’ve created a system that colocates indexing and searching on the same node. Although it seems like efficient usage of resources, it has its downsides. Because searching and indexing are both resource-intensive operations, one will impact the performance of the other. Also, this colocated design doesn’t scale efficiently with varying indexing and search operations over time. Colocating both these operations on the same machine can lead to an imbalance, thus resulting in scalability issues.
  2. Index recomputation: We assumed that each replica will compute the index individually which leads to inefficient usage of resources. Furthermore, index computation is a resource-intensive task with possibly hundreds of stages of pipelined operations. Thus, recomputing the same index over different replicas requires powerful machines. Instead, it is logical to compute the index once and replicate it across availability zones.

Because of these key reasons, we will look at an alternative approach to distributed indexing and searching.

Solution

Rather than recomputing the index on each replica, we compute the inverted index on the primary node only. Next, we communicate the inverted index (binary blob/file) to the replicas. The key benefit of this approach is that it avoids using the duplicated amount of CPU and memory for indexing on replicas.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy