Search⌘ K
AI Features

Semantic Search: Model Architecture

Explore the architecture of semantic search systems by understanding dual encoder models, hybrid retrieval techniques combining dense and sparse signals, and precise cross-encoder re-ranking. This lesson guides you through making trade-offs between latency and accuracy to build effective search pipelines.

With high-quality training triplets of queries, relevant passages, and hard negatives already flowing from the data pipeline, the next design decision is the model architecture that actually performs retrieval. Every MAANG semantic search interview expects a two-stage pipeline: a fast first-stage retriever that scores millions of documents in milliseconds, followed by a precise but expensive re-ranker applied only to the top-K candidates. This lesson covers three architectural decisions that define that pipeline: dual encoder selection, hybrid retrieval fusion, and cross-encoder re-ranking, along with the latency-accuracy trade-offs that govern each choice.

Dual encoder models for dense retrieval

How dual encoders work

A dual encoder architecture uses two separate transformer networks: one encodes the query into a fixed-size vector, and the other encodes the document into a vector of the same dimensionality. Relevance between a query and a document is scored by computing the dot product or cosine similarity between their embeddings. The critical design advantage is that document embeddings are precomputed offline and stored in an ANN index. At query time, only the query encoder runs a forward pass, and the ANN index returns the nearest document vectors in milliseconds.

Think of it like a library where every book has already been assigned a GPS coordinate on a map. When a reader walks in with a question, the system converts that question into its own GPS coordinate and finds the nearest books without ever opening a single page.

Comparing DPR, E5, and BGE

Three production-grade dual encoders dominate the design space, each with distinct training strategies and trade-offs.

  • DPR (Dense Passage Retrieval): This model uses two separate BERT-base encoders trained with contrastive loss on BM25-mined hard negatives. It is the simplest to fine-tune on domain-specific data but is limited by its single-task training, which weakens zero-shot generalization to unseen query types.

  • E5 (EmbEddings from bidirEctional Encoder rEpresentations): E5 uses a unified encoder with task-specific prefixes such as query: and passage: prepended to inputs. It is pre-trained on massive weakly supervised contrastive pairs before fine-tuning, which gives it significantly stronger zero-shot performance across diverse retrieval tasks.

  • BGE (BAAI General Embedding): BGE follows a similar unified encoder approach but adds a RetroMAE pre-training stage and instruction-tuned fine-tuning. This combination has produced state-of-the-art results on the MTEB benchmark suite, though it requires careful prompt formatting during inference.

All three models produce embeddings in the 768–1024 dimension range, which directly impacts ANN index memory footprint and query latency.

Practical tip: If you have abundant domain-specific labeled data, DPR’s simplicity makes it the fastest path to a strong fine-tuned model. If you need strong out-of-the-box performance with minimal labeled data, E5 or BGE are better starting points.

The following table summarizes the key differences across these models.

Comparison of Dense Retrieval Models

Model

Encoder Architecture

Pre-training Strategy

Hard Negative Strategy

Embedding Dimension

Zero-shot Strength

Fine-tuning Complexity

DPR

Dual BERT-base

Contrastive learning on NQ

BM25-mined negatives

768

Moderate

Low

E5

Unified encoder with prefixes

Weakly supervised contrastive at scale

In-batch + mined negatives

1024

Strong

Medium

BGE

Unified encoder with instructions

RetroMAE + instruction tuning

In-batch + cross-encoder mined

1024

State-of-the-art

Medium-high

With a dual encoder selected, the next question is whether dense retrieval alone is sufficient or whether sparse signals need to fill the gaps. ...