Search⌘ K
AI Features

Evaluation for Generative AI Systems

Explore how to evaluate generative AI systems by mastering the RAG triad framework, which separates retrieval and generation errors through faithfulness, context relevance, and groundedness. Understand automated LLM judge methods and their biases, and learn to design robust human evaluation with clear rubrics and inter-annotator agreement. This lesson helps you propose effective, layered evaluation strategies essential for senior ML system design interviews.

You are designing a RAG-based customer support assistant for an e-commerce platform. A user asks, “Can I return my opened headphones?” The system retrieves a return-policy document and generates an answer. The answer may sound fluent and confident. How do you know it is correct? How do you know the system retrieved the correct policy document and did not hallucinate a 90-day return window that is not in the policy? This evaluation problem is central to many generative AI system design interviews, and classification and ranking metrics alone are not enough here.

Traditional metrics like BLEU and ROUGE were designed for a world where one correct reference answer exists. BLEU measures n-gram precision between a generated output and a reference translation, while ROUGE measures recall of reference n-grams in the generated text. Both assume that lexical overlap with a gold-standard answer signals quality. For open-ended generation tasks such as summarization, conversational agents, and question answering over retrieved documents, this assumption collapses. Many valid outputs exist for the same input. A response can score high on ROUGE yet hallucinate facts simply because the hallucinated text happens to share n-grams with the reference. Conversely, a perfectly accurate answer phrased differently scores low.

This is not a minor inconvenience. It is a fundamental mismatch between the metric and the task.

Generative AI evaluation must separately assess the retrieval component and the generation component. If the system produces a bad answer, you need to know whether the retriever surfaced the wrong documents or the generator fabricated claims beyond what was retrieved. This diagnostic separation is what industry practitioners emphasize as the critical nuance for production systems. The structured framework that replaces single-score metrics is the RAG triad, which evaluates faithfulness, context relevance, and groundedness as independent dimensions.

The following table clarifies where traditional metrics break down and when the RAG triad becomes necessary:

Comparison of Evaluation Metrics for Generative AI

Metric

What It Measures

Failure Mode for Generative AI

When Still Useful

BLEU

N-gram precision against a reference

Penalizes valid paraphrases that use different vocabulary

Machine translation with constrained outputs

ROUGE

Recall of reference n-grams in generated text

Rewards hallucinated text that happens to overlap with the reference

Extractive summarization

BERTScore

Semantic similarity via contextual embeddings

Misses factual errors when semantics are similar but claims are wrong

Semantic similarity screening as a first pass

RAG Triad

Faithfulness plus relevance plus groundedness

Requires careful prompt design or human annotation to compute

RAG-based systems and open-ended generation

With this foundation in place, the next step is to unpack each dimension of the RAG triad and understand exactly what it measures in the context of a live system.

The RAG triad as an evaluation framework

...