RTEB is redefining how we measure model retrieval accuracy

RTEB is redefining how we measure model retrieval accuracy

RTEB, the Retrieval Embedding Benchmark, introduces a new way to evaluate how well models retrieve information across real-world and unseen domains. By combining open and private datasets, it measures true generalization — helping researchers and practitioners identify which models actually perform in practice, not just on public leaderboards.
13 mins read
Oct 27, 2025
Share

Why do today’s search and retrieval systems still leave us frustrated? We’ve all experienced it: you ask a question or enter a query, and the system returns results that miss the mark entirely. Large language models (LLMs) can produce incorrect answers for several reasons, but one common retrieval-induced failure happens when the system can’t fetch the right supporting evidence. When that happens, the model often fills the gaps, producing confident but fabricated details.

Hallucinations can also stem from reasoning or training limitations; retrieval failure is just one major trigger.

The implications are serious. Everything from search engines and chatbots to recommendation systems and enterprise Q&A relies on robust retrieval. If our benchmarks aren’t capturing true retrieval quality, users pay the price in irrelevant results and LLMs that hallucinate answers. Improving this starts with better evaluation. Enter RTEB: the Retrieval Embedding Benchmark, a new initiative designed to bridge that generalization gap and focus on what matters for search in the real world. Before we explore RTEB, let’s quickly look at how we got here with previous benchmarks.

From BEIR to MTEB: Benchmarking retrieval and beyond#

A few years ago, in 2021, the research community introduced BEIR  (Benchmarking Information Retrieval) to respond to the narrow evaluation of retrieval models. BEIR was a heterogeneous benchmark of 18 datasets spanning diverse IR tasks and domains. The idea was to test zero-shot retrieval: could a model trained on one domain (or on general data) retrieve relevant documents in completely different domains without fine-tuning? BEIR included answering biomedical questions, fact-checking claims, and duplicate detection, primarily in English. It used metrics like NDCG (Normalized Discounted Cumulative Gain at rank 10) to measure the ranking quality of results. BEIR’s broad coverage quickly made it a standard for evaluating general-purpose retrievers. However, because all BEIR datasets were publicly available, models could eventually “see” them during training, raising concerns about overfitting to the benchmark rather than achieving true generalization.

BEIR was a milestone as it standardized zero-shot retrieval evaluation across diverse domains. But it had two clear limitations: narrow scope and monolingual bias. All 18 BEIR datasets were in English, and the benchmark focused only on retrieval, ignoring the growing variety of embedding applications (like classification, clustering, or reranking).

In 2023, the scope expanded with MTEB (Massive Text Embedding Benchmark). MTEB pushed beyond just retrieval and beyond English. In fact, MTEB spans 8 different embedding tasks with a total of 58 datasets in 112 languages, though only a subset of its datasets are multilingual. It evaluates not only retrieval, but also tasks like text classification, clustering, pair similarity (STS), reranking, etc., essentially a comprehensive workout for embedding models.

You can think of MTEB as an umbrella: it largely superset the retrieval tasks of BEIR while adding many others. The MTEB leaderboard gives an overall score averaged across tasks, so no single model dominates everywhere. By covering multilingual data (over 100 languages) and varied task types, MTEB helped identify truly versatile embeddings. Yet, MTEB too had a weakness for our purposes: it still relied on publicly available datasets for evaluation. In other words, some models end up “taught to the test,” meaning they perform well because their training data quietly includes pieces of the test sets, allowing them to memorize the right answers, rather than demonstrate any real-world understanding. Moreover, MTEB’s breadth meant that pure retrieval performance might get diluted in the mix of tasks.

Both BEIR and MTEB moved the needle for benchmarking, but the community felt something was missing: a benchmark focused on real-world retrieval scenarios with a high bar for generalization. This is the gap that RTEB aims to fill.

What is RTEB, and why does it matter?#

RTEB stands for Retrieval Embedding Benchmark, introduced in late 2024 as a new standard for evaluating embedding-based search. Its goal is simple: measure a model’s true retrieval accuracy on data it hasn’t seen before. RTEB is built differently from prior benchmarks to perform the tasks mentioned below.

  • Hybrid open/private evaluation: RTEB, a new benchmark, uses a hybrid open/private evaluation approach to prevent overfitting. It combines public datasets (corpus, queries, relevance labels) with secret private datasets. Models are evaluated on both, with private set performance indicating generalization. A drop in private set scores suggests overfitting to known data. The leaderboard displays both scores for transparency and impartial measurement.

  • Real-world domain coverage: RTEB goes beyond academic or Wikipedia-style datasets by including retrieval tasks drawn from real user and enterprise domains, such as legal, finance, scientific, and multilingual news collections. These datasets reflect the variety and messiness of real-world search environments, where queries are ambiguous, documents vary in style, and answers may span multiple sources. This domain diversity forces embedding models to generalize beyond clean, curated text.

  • Unified retrieval metric (NDCG@10): RTEB prevents leaderboard manipulation by excluding training data from the benchmark. Unlike academic benchmarks, RTEB uses authentic, domain-specific datasets covering 20 languages (English, Japanese, Bengali, Finnish, etc.) and enterprise sectors like legal, finance, healthcare, and coding.

  • Community and transparency: RTEB is designed as an open, community-driven benchmark hosted on Hugging Face. Researchers and developers can submit models, inspect dataset composition, and compare results transparently through a public leaderboard. Its evaluation pipeline is open-sourced, with private datasets managed through trusted partners to ensure fairness while maintaining confidentiality. Like MTEB, it uses NDCG@10 as the standard metric, rewarding models that rank relevant results higher. It does this by enabling consistent and transparent performance comparisons across all submissions.

In short, RTEB matters because it provides better evaluation to identify models that will succeed in real applications. It forces models to prove themselves on unseen challenges and domains that people care about. Next, let’s compare RTEB with BEIR and MTEB to see how they differ in design and focus.

BEIR vs. MTEB vs. RTEB: How do they compare?#

To crystallize the differences, here’s a quick comparison of these three benchmarks:

Benchmark

Year

Scope

Datasets and Tasks

Languages

Notable Features

Primary Metric

BEIR (Heterogeneous IR Benchmark)

2021

Retrieval only (zero-shot)

18 public datasets, diverse domains (e.g. Wikipedia QA, scientific, finance, fact-checking)

Mostly English (monolingual tasks)

Established zero-shot evaluation for robust IR; all datasets are public

NDCG@10 (per dataset), with Recall@100 as secondary

MTEB (Massive Text Embedding Benchmark)

2023

Multi-task embedding evaluation

58 public datasets across 8 tasks (retrieval, classification, clustering, STS, reranking, etc.)arxiv.org

112 languages (many multilingual tasks) arxiv.org

Broad coverage of embedding use-cases; leaderboard aggregates performance across tasks

Varies by task (e.g. accuracy/F1 for classification, NDCG for retrieval)

RTEB (Retrieval Embedding Benchmark)

2024 (beta)

Retrieval only (real-world focus)

28 datasets (15 open + 13 private in initial version) covering domains like legal, finance, health, code

20 languages (incl. English, Japanese, French, Bengali, etc.) huggingface.co

Mix of open and hidden test sets to ensure generalization; domain-specific, enterprise use cases; community-driven updates

NDCG@10 (leaderboard’s main metric) huggingface.co

BEIR pioneered the idea of a one-stop benchmark for retrieval, making it easy to test a model on many domains with one script. MTEB broadened that horizon to basically all embedding tasks and a ton of languages, which is great for finding a well-rounded model. RTEB, meanwhile, doubles down on search quality in practice, i.e., it’s narrower in task (only retrieval) but deeper in ensuring that the model works where it counts (with hidden tests and domain relevance).

Why the emphasis on NDCG@10? In all three benchmarks, ranking metrics matter a lot because in search, the order of results is everything. Let’s take a brief detour to simplify NDCG@K, since it’s central to RTEB.

Understanding NDCG@10 (Normalized Discounted Cumulative Gain)   #

Suppose that you have a set of relevant documents for a query and you want them ideally at the top of your search results. Discounted Cumulative Gain (DCG) is a way to score a list of results by giving higher-ranked items more credit. For a result at rank ii with relevance score relirel_i​ DCG adds up each item’s relevance divided by the log of its rank:

Here,the denominator log2(i+1)\log_2(i+1) is a discount factor that reduces the contribution of results lower in the ranking (so a relevant document at rank 5 contributes less than one at rank 1).

In practice, relᵢ is a graded relevance score assigned by human annotators or heuristics, typically on a small scale like 0–3, where 0 = irrelevant, 1 = partially relevant, 2 = mostly relevant, and 3 = highly relevant. Some datasets use binary labels (0/1), but graded scores capture how strongly a document answers the query.

Now, NDCG@K is just the DCG of your model’s ranking normalized by the ideal DCG (IDCG),  i.e. the DCG you’d get if all the truly relevant items were ranked in perfect order in the top K. Normalization scales the score between 0 and 1. An NDCG@10 of 1.0 would mean your top 10 results are a perfect ordering of the 10 most relevant items, while 0.0 would be the worst-case (no relevant items in the top 10). In practice, a strong retriever might achieve, say, NDCG@10 = 0.65 on a task, meaning it’s getting a lot right, but not perfectly. NDCG is “rank-aware,” rewarding you for not just finding relevant documents, but ranking them well. 

This makes it more informative for search quality than a simpler metric like precision. By using NDCG@10 as the primary metric, benchmarks like BEIR and RTEB ensure that models learn to put the best answers up-front, which is exactly what users need.

Let’s now talk about the RTEB leaderboard and what we can learn from today’s top-performing embedding models.

Leaderboard insights: Which models actually generalize?#

One of RTEB’s most exciting features is its public leaderboard, where embedding models are ranked by how well they retrieve information across open and closed domains. But RTEB adds a twist that changes how we read those rankings.

Models that look stellar on older public benchmarks such as BEIR or MTEB often stumble when tested on RTEB’s private, unseen datasets. For example, the launch blog for RTEB reports that models with strong open-data scores still experience a significant drop on hidden private subsets. That’s exactly what RTEB was built to measure by using a hybrid open-and-private evaluation framework that tests not just benchmark performance, but real-world generalization.

The truly robust ones are those that show only a small drop from open (public) to closed (private) evaluations.

Model/ Family

MTEB Avg

BEIR Avg

RTEB (Open)

RTEB (Closed)

Open→Closed Drop

Notes

OpenAI text-embedding-ada-002 (2022)

~61.0%

mid-50s

≈ MTEB-like

Slightly lower

Small

Stable zero-shot; good generalization.

OpenAI text-embedding-3-large (2024)

~64.6%

Top-tier

Top-tier

Small

Successor to ada-002; higher generalization.

Cohere embed-multilingual-v3.0

~64.0%

~54.6%

High

High but lower

Moderate

Strong multilingual; small–moderate closed-set dip.

E5-large (open-source)

— (strong)

High (0.5–0.7 on some tasks)

Lower

Moderate–significant

Broad contrastive training; generally robust.

BGE-large-en-v1.5 (open-source)

MTEB #1 (mid-2023)

High

Lower

Moderate–significant

Very strong on public sets; some drop on private.

Instructor-XL (instr-tuned)

SOTA on some BEIR

High

Notably lower

Significant

Shows overfitting risk (bigger private-set drop).

Key takeaways:

  • RTEB recalibrates expectations: Even state-of-the-art models now cluster around mid–0.60s NDCG@10 (combined open and closed). That’s currently the “SOTA” for real-world retrieval.

  • Look beyond leaderboard peaks: When choosing a retriever, prioritize models with smaller open→closed gaps, as those are your true generalizers.

  • Think of RTEB as MTEB 2.0 for retrieval: This has the same spirit of broad benchmarking, but with guardrails against overfitting and a deeper focus on unseen, multilingual, and enterprise-style data.

Real-world use case#

How does RTEB actually help in practice? Let’s walk through a scenario. Suppose you’re building an enterprise search system or a Retrieval-Augmented Generation (RAG) pipeline for a global company. Your system will take user questions and retrieve documents from an internal knowledge base to feed into an LLM (so it can ground its answers and reduce hallucinations). This involves typical RAG components: an embedding model to vectorize queries and documents, a vector database for search, and the LLM to generate answers using retrieved information.

In an enterprise setting, your data isn’t just Wikipedia articles. It could include legal contracts, financial reports, customer support tickets, technical documents, and it may span multiple languages (offices around the world). You need an embedding model that retrieves accurately across all those domains and languages. This is exactly where RTEB’s design is beneficial.

  • Relevant domains: RTEB includes domain-specific evaluation sets like legal case retrieval, finance Q and A, code search, and healthcare FAQs. If your use case is, say, legal document search, you can check the leaderboard for which models perform best in the “legal” category of RTEB. As RTEB’s legal datasets are tailored (e.g. finding relevant case law from queries), a model that excels there is more likely to handle your proprietary legal documents well, compared to a model that might have just been good at trivia QA. In essence, RTEB lets you benchmark models on tasks analogous to your application, increasing confidence in the chosen model.

  • Multilingual capability: RTEB’s coverage of 20 languages means it tests whether a model trained on, say, English and German can also retrieve well in French or Japanese. If your company operates in Europe and Asia, you likely have content in English, French, German, Japanese, etc. You’d want a single model that can handle queries and documents in all those languages (or at least a plan per language). On RTEB, you can see, for example, that a model like LaBSE or DistilUSE ( its multilingual variant) might do consistently well across languages, whereas an English-centric model might score high on English tasks, but fail on Japanese. This insight prevents costly mistakes like deploying a model that only works for half of your users. In fact, RTEB’s early results showed that multilingual models retained stronger performance on the hidden (multilingual) datasets, hinting that they’re more robust for a global audience.

  • RAG grounding and hallucination reduction: The whole point of RAG is to ground LLM outputs in real data. A known challenge is that LLMs, if the retrieval step fails, will just make up an answer (the dreaded hallucination). By using RTEB to pick a top-tier retriever, you are effectively minimizing those retrieval failures. For instance, if RTEB shows Model X has NDCG@10= 0.80 on a medical FAQ dataset (meaning it surfaces relevant info very reliably), using Model X in your medical chatbot will supply the LLM with the correct facts most of the time. This means that the LLM doesn’t have to improvise an answer out of thin air. In contrast, a weaker model with NDCG@10= 0.50 might miss important documents, leading the LLM to fill gaps with fiction. RTEB even uses a metric that correlates with good answer ranking, so optimizing for NDCG also tends to improve question-answering quality in RAG.

  • Enterprise evaluation: RTEB’s inclusion of private datasets also mirrors an enterprise’s situation where you have internal data not seen during model pre-training. It tests how a model deals with truly novel data. If a model’s performance doesn’t drop much from open to closed sets, it’s a good sign that it can handle your private corpora. This gives a more realistic evaluation of “enterprise readiness,” rather than benchmarks that only use public data. In an enterprise RAG evaluation, you might even treat some of your internal data as a “private eval set,” akin to RTEB’s approach to benchmark candidate models before deployment.

To sum up, RTEB isn’t just an academic exercise; it directly addresses everyday needs of systems that search, whether it’s a multilingual helpdesk chatbot, a legal research assistant, or a coding helper. By following RTEB’s lead (and even using it as a validation tool for your own model evaluations), you’re more likely to choose embeddings that make your search actually work for your users.

Future outlook#

RTEB is a significant advancement in retrieval benchmarks, but future improvements are needed. Key areas for development are outlined below.

  • Hybrid search evaluation: Incorporating hybrid methods (keyword and embeddings) and multimodal retrieval (text-image, text-audio) with diverse query types.

  • Multimodal retrieval: Expanding beyond text to include various data types, requiring suitable datasets and relevance definitions.

  • Dynamic and continual evaluation: Benchmarks should evolve with “living” data and user-driven evaluation to measure model adaptation, with RTEB already moving in this direction.

The future of retrieval evaluation is holistic, aiming for benchmarks that are more representative, harder to game, and constantly evolving, much like real-world information. RTEB addresses the shortcomings of static benchmarks.

Future retrieval evaluation needs to be holistic, moving toward representative, robust, and evolving benchmarks. RTEB addresses the limitations of static benchmarks and aims to become a trusted standard for search evaluation, building on lessons from BEIR and MTEB. Future plans include multimodal retrieval and dynamic evaluation to handle evolving data and diverse types.

Ready to expand your generative AI skillset? Explore the following courses:


Written By:
Fahim
The AI Infrastructure Blueprint: 5 Rules to Stay Online
Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.
9 mins read
Apr 9, 2025