Semantic Search: Model Architecture

Explore the architecture of semantic search systems by understanding dual encoder models, hybrid retrieval techniques combining dense and sparse signals, and precise cross-encoder re-ranking. This lesson guides you through making trade-offs between latency and accuracy to build effective search pipelines.

We'll cover the following...

Dual encoder models for dense retrieval
- How dual encoders work
- Comparing DPR, E5, and BGE
Hybrid retrieval with learned fusion
- The score normalization challenge
- Fusion strategies
Cross-encoder re-ranking for top-K precision
- Why a second stage is necessary
  - ColBERT as a middle ground
Bridging to evaluation and indexing

With high-quality training triplets of queries, relevant passages, and hard negatives already flowing from the data pipeline, the next design decision is the model architecture that actually performs retrieval. Every MAANG semantic search interview expects a two-stage pipeline: a fast first-stage retriever that scores millions of documents in milliseconds, followed by a precise but expensive re-ranker applied only to the top-K candidates. This lesson covers three architectural decisions that define that pipeline: dual encoder selection, hybrid retrieval fusion, and cross-encoder re-ranking, along with the latency-accuracy trade-offs that govern each choice.

Dual encoder models for dense retrieval

How dual encoders work

A dual encoder architecture uses two separate transformer networks: one encodes the query into a fixed-size vector, and the other encodes the document into a vector of the same dimensionality. Relevance between a query and a document is scored by computing the dot product or cosine similarity between their embeddings. The critical design advantage is that document embeddings are precomputed offline and stored in an ANN index. At query time, only the query encoder runs a forward pass, and the ANN index returns the nearest document vectors in milliseconds.

Think of it like a library where every book has already been assigned a GPS coordinate on a map. When a reader walks in with a question, the system converts that question into its own GPS coordinate and finds the nearest books without ever opening a single page.

Comparing DPR, E5, and BGE

Three production-grade dual encoders dominate the design space, each with distinct training strategies and trade-offs.

DPR (Dense Passage Retrieval): This model uses two separate BERT-base encoders trained with contrastive loss on BM25-mined hard negatives. It is the simplest to fine-tune on domain-specific data but is limited by its single-task training, which weakens zero-shot generalization to unseen query types.
E5 (EmbEddings from bidirEctional Encoder rEpresentations): E5 uses a unified encoder with task-specific prefixes such as query: and passage: prepended to inputs. It is pre-trained on massive weakly supervised contrastive pairs before fine-tuning, which gives it significantly stronger zero-shot performance across diverse retrieval tasks.
BGE (BAAI General Embedding): BGE follows a similar unified encoder approach but adds a RetroMAE pre-training stage and instruction-tuned fine-tuning. This combination has produced state-of-the-art results on the MTEB benchmark suite, though it requires careful prompt formatting during inference.

All three models produce embeddings in the 768–1024 dimension range, which directly impacts ANN index memory footprint and query latency.

Practical tip: If you have abundant domain-specific labeled data, DPR’s simplicity makes it the fastest path to a strong fine-tuned model. If you need strong out-of-the-box performance with minimal labeled data, E5 or BGE are better starting points.

The following table summarizes the key differences across these models.

Model	Encoder Architecture	Pre-training Strategy	Hard Negative Strategy	Embedding Dimension	Zero-shot Strength	Fine-tuning Complexity
DPR	Dual BERT-base	Contrastive learning on NQ	BM25-mined negatives	768	Moderate	Low
E5	Unified encoder with prefixes	Weakly supervised contrastive at scale	In-batch + mined negatives	1024	Strong	Medium
BGE	Unified encoder with instructions	RetroMAE + instruction tuning	In-batch + cross-encoder mined	1024	State-of-the-art	Medium-high

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Semantic Search: Model Architecture

Dual encoder models for dense retrieval

How dual encoders work

Comparing DPR, E5, and BGE

Comparison of Dense Retrieval Models