Search⌘ K
AI Features

Enterprise RAG: Evaluation

Explore the evaluation framework for enterprise RAG systems focusing on three critical dimensions: faithfulness to detect hallucinations, relevance to assess retrieval quality, and groundedness for per-claim citation accuracy. Understand automated LLM-as-judge pipelines combined with human calibration to create scalable, production-ready evaluations. This lesson helps you design rigorous evaluation methods essential for reliable enterprise ML systems.

In a MAANG ML system design interview, you have just finished walking through your RAG pipeline: retrieval, reranking, context window packing, guardrails. The interviewer leans forward and asks: “How do you know this system actually works?” Most candidates stumble here. They mention BLEU or ROUGE, metrics designed for machine translation that tell you almost nothing about whether a RAG system is hallucinating, retrieving the wrong documents, or fabricating citations. The evaluation framework you present in the next two minutes will determine whether the interviewer sees you as someone who builds production systems or someone who stopped at the tutorial.

The previous lesson designed reranking, context window management, guardrails, and agentic RAG. None of those components can be trusted without rigorous evaluation. Production experience reveals a counterintuitive truth about RAG failures. The primary failure mode is not generation quality but inadequate retrieval, which leads to ungrounded or hallucinated responses. Consider an enterprise legal Q&A system where a hallucinated contract clause could trigger compliance violations. Evaluation must cover three distinct dimensions: faithfulness, relevance, and groundedness. Conflating them is one of the most common interview mistakes. This lesson defines each dimension, designs automated LLM-as-judge pipelines, then layers in human evaluation as the calibration mechanism.

Three evaluation dimensions

Each dimension answers a different question about a different stage of the RAG pipeline. Getting precise about these distinctions signals design maturity.

  • Faithfulness asks whether the generated answer accurately reflects the information in the retrieved context without adding unsupported claims. This dimension catches hallucination. A model that invents a contract clause not present in any retrieved chunk fails on faithfulness.

  • Relevance asks whether the retrieved context actually contains information needed to answer the user’s query. This dimension catches retrieval failures that occur upstream of generation. If retrieved chunks discuss a different product line than the one queried, relevance is low regardless of how well the generator summarizes those chunks.

  • Groundedness asks whether every individual claim in the answer is supported by a specific, citable passage in the context. This is stricter than ...