Search⌘ K
AI Features

Re-Ranking and Relevance Scoring

Explore techniques for refining search results in vector databases by applying two-stage retrieval that combines fast approximate searches with precise re-ranking methods. Understand how cross-encoders capture token-level relevance, how Reciprocal Rank Fusion merges multiple ranked lists without training, and how LLM-based re-rankers enhance semantic precision. Learn to balance accuracy and computational cost to improve output quality in large language model applications.

The query transformation techniques covered in the previous lesson improve what goes into the vector index lookup, but the results that come back still need refinement. A bi-encoder embedding model enables fast approximate nearest neighbor search over millions of chunks, yet it compresses both the query and each document into independent fixed-size vectors. That independence is the source of both its speed and its weakness. Because the query vector and the document vector never “see” each other during encoding, the model cannot capture fine-grained token-level interactions between them. The top-k results from a single ANN pass often contain near-misses that are topically adjacent but not truly relevant to the query’s intent.

When these imprecise results fill the LLM’s context window, generation quality degrades. The model may hallucinate, contradict itself, or produce vague answers because the supporting evidence was only loosely related. The solution is a pattern called two-stage retrieval. A cheap, high-recall first stage casts a wide net over the full corpus, and a computationally heavier second stage reorders those candidates by true relevance before they reach the LLM. This pattern is standard in production RAG systems on AWS and elsewhere, and the rest of this lesson breaks down exactly how it works.

The following diagram illustrates the full two-stage pipeline from query to final context window.

Two-stage retrieval pipeline with bi-encoder recall and cross-encoder re-ranking for RAG systems
Two-stage retrieval pipeline with bi-encoder recall and cross-encoder re-ranking for RAG systems

Cross-encoder re-ranking

A cross-encoderA transformer model that takes a query and a document as a single concatenated input and outputs a relevance score, unlike a bi-encoder which encodes them independently. works differently from the bi-encoder used in the first pass. Instead of embedding the query and document separately, it concatenates them into a single input sequence separated by a special [SEP] token and passes both through one transformer forward pass. The model’s output is a single relevance score, typically a logit passed through a sigmoid function.

How cross-encoders capture relevance

Because every token in the query can attend to every token in the document through the transformer’s self-attention mechanism, cross-encoders capture fine-grained semantic relationships that independent embeddings miss entirely. A bi-encoder might rank a passage about “bank erosion along rivers” highly for the query “bank account interest rates” because the word “bank” pulls the vectors closer. A cross-encoder, processing both texts together, resolves this ambiguity through direct token interaction.

Why cross-encoders cannot replace the first stage

Scoring every chunk in a million-document corpus with a cross-encoder is computationally infeasible. If each forward pass takes 10 milliseconds, scoring one million candidates would take nearly three hours. The first-pass ANN search narrows candidates to a manageable set, typically 20 to 100 chunks, making cross-encoder scoring practical.

Two metrics are commonly used to evaluate re-ranker quality. ...