Evaluation for Generative AI Systems
Explore how to evaluate generative AI systems by mastering the RAG triad framework, which separates retrieval and generation errors through faithfulness, context relevance, and groundedness. Understand automated LLM judge methods and their biases, and learn to design robust human evaluation with clear rubrics and inter-annotator agreement. This lesson helps you propose effective, layered evaluation strategies essential for senior ML system design interviews.
You are designing a RAG-based customer support assistant for an e-commerce platform. A user asks, “Can I return my opened headphones?” The system retrieves a return-policy document and generates an answer. The answer may sound fluent and confident. How do you know it is correct? How do you know the system retrieved the correct policy document and did not hallucinate a 90-day return window that is not in the policy? This evaluation problem is central to many generative AI system design interviews, and classification and ranking metrics alone are not enough here.
Traditional metrics like BLEU and ROUGE were designed for a world where one correct reference answer exists. BLEU measures n-gram precision between a generated output and a reference translation, while ROUGE measures recall of reference n-grams in the generated text. Both assume that lexical overlap with a gold-standard answer signals quality. For open-ended generation tasks such as summarization, conversational agents, and question answering over retrieved documents, this assumption collapses. Many valid outputs exist for the same input. A response can score high on ROUGE yet hallucinate facts simply because the hallucinated text happens to share n-grams with the reference. Conversely, a perfectly accurate answer phrased differently scores low.
This is not a minor inconvenience. It is a fundamental mismatch between the metric and the task.
Generative AI evaluation must separately assess the retrieval component and the generation component. If the system produces a bad answer, you need to know whether the retriever surfaced the wrong documents or the generator fabricated claims beyond what was retrieved. This diagnostic separation is what industry practitioners emphasize as the critical nuance for production systems. The structured framework that replaces single-score metrics is the RAG triad, which evaluates faithfulness, context relevance, and groundedness as independent dimensions.
The following table clarifies where traditional metrics break down and when the RAG triad becomes necessary:
Comparison of Evaluation Metrics for Generative AI
Metric | What It Measures | Failure Mode for Generative AI | When Still Useful |
BLEU | N-gram precision against a reference | Penalizes valid paraphrases that use different vocabulary | Machine translation with constrained outputs |
ROUGE | Recall of reference n-grams in generated text | Rewards hallucinated text that happens to overlap with the reference | Extractive summarization |
BERTScore | Semantic similarity via contextual embeddings | Misses factual errors when semantics are similar but claims are wrong | Semantic similarity screening as a first pass |
RAG Triad | Faithfulness plus relevance plus groundedness | Requires careful prompt design or human annotation to compute | RAG-based systems and open-ended generation |
With this foundation in place, the next step is to unpack each dimension of the RAG triad and understand exactly what it measures in the context of a live system.