Semantic Search: Model Architecture
Explore the architecture of semantic search systems by understanding dual encoder models, hybrid retrieval techniques combining dense and sparse signals, and precise cross-encoder re-ranking. This lesson guides you through making trade-offs between latency and accuracy to build effective search pipelines.
With high-quality training triplets of queries, relevant passages, and hard negatives already flowing from the data pipeline, the next design decision is the model architecture that actually performs retrieval. Every MAANG semantic search interview expects a two-stage pipeline: a fast first-stage retriever that scores millions of documents in milliseconds, followed by a precise but expensive re-ranker applied only to the top-K candidates. This lesson covers three architectural decisions that define that pipeline: dual encoder selection, hybrid retrieval fusion, and cross-encoder re-ranking, along with the latency-accuracy trade-offs that govern each choice.
Dual encoder models for dense retrieval
How dual encoders work
A dual encoder architecture uses two separate transformer networks: one encodes the query into a fixed-size vector, and the other encodes the document into a vector of the same dimensionality. Relevance between a query and a document is scored by computing the dot product or cosine similarity between their embeddings. The critical design advantage is that document embeddings are precomputed offline and stored in an ANN index. At query time, only the query encoder runs a forward pass, and the ANN index returns the nearest document vectors in milliseconds.
Think of it like a library where every book has already been assigned a GPS coordinate on a map. When a reader walks in with a question, the system converts that question into its own GPS coordinate and finds the nearest books without ever opening a single page.
Comparing DPR, E5, and BGE
Three production-grade dual encoders dominate the design space, each with distinct training strategies and trade-offs.
DPR (Dense Passage Retrieval): This model uses two separate BERT-base encoders trained with contrastive loss on BM25-mined hard negatives. It is the simplest to fine-tune on domain-specific data but is limited by its single-task training, which weakens zero-shot generalization to unseen query types.
E5 (EmbEddings from bidirEctional Encoder rEpresentations): E5 uses a unified encoder with task-specific prefixes such as
query:andpassage:prepended to inputs. It is pre-trained on massive weakly supervised contrastive pairs before fine-tuning, which gives it significantly stronger zero-shot performance across diverse retrieval tasks.BGE (BAAI General Embedding): BGE follows a similar unified encoder approach but adds a RetroMAE pre-training stage and instruction-tuned fine-tuning. This combination has produced state-of-the-art results on the MTEB benchmark suite, though it requires careful prompt formatting during inference.
All three models produce embeddings in the 768–1024 dimension range, which directly impacts ANN index memory footprint and query latency.
Practical tip: If you have abundant domain-specific labeled data, DPR’s simplicity makes it the fastest path to a strong fine-tuned model. If you need strong out-of-the-box performance with minimal labeled data, E5 or BGE are better starting points.
The following table summarizes the key differences across these models.
Comparison of Dense Retrieval Models
Model | Encoder Architecture | Pre-training Strategy | Hard Negative Strategy | Embedding Dimension | Zero-shot Strength | Fine-tuning Complexity |
DPR | Dual BERT-base | Contrastive learning on NQ | BM25-mined negatives | 768 | Moderate | Low |
E5 | Unified encoder with prefixes | Weakly supervised contrastive at scale | In-batch + mined negatives | 1024 | Strong | Medium |
BGE | Unified encoder with instructions | RetroMAE + instruction tuning | In-batch + cross-encoder mined | 1024 | State-of-the-art | Medium-high |
With a dual encoder selected, the next question is whether dense retrieval alone is sufficient or whether sparse signals need to fill the gaps. ...