Why do today’s search and retrieval systems still leave us frustrated? We’ve all experienced it: you ask a question or enter a query, and the system returns results that miss the mark entirely. Large language models (LLMs) can produce incorrect answers for several reasons, but one common retrieval-induced failure happens when the system can’t fetch the right supporting evidence. When that happens, the model often fills the gaps, producing confident but fabricated details.
Hallucinations can also stem from reasoning or training limitations; retrieval failure is just one major trigger.
The implications are serious. Everything from search engines and chatbots to recommendation systems and enterprise Q&A relies on robust retrieval. If our benchmarks aren’t capturing true retrieval quality, users pay the price in irrelevant results and LLMs that hallucinate answers. Improving this starts with better evaluation. Enter RTEB: the Retrieval Embedding Benchmark, a new initiative designed to bridge that generalization gap and focus on what matters for search in the real world. Before we explore RTEB, let’s quickly look at how we got here with previous benchmarks.