Evaluation for Generative AI Systems

Explore how to evaluate generative AI systems by mastering the RAG triad framework, which separates retrieval and generation errors through faithfulness, context relevance, and groundedness. Understand automated LLM judge methods and their biases, and learn to design robust human evaluation with clear rubrics and inter-annotator agreement. This lesson helps you propose effective, layered evaluation strategies essential for senior ML system design interviews.

We'll cover the following...

The RAG triad as an evaluation framework
LLM-as-judge evaluation frameworks
- How LLM judges work
  - Known biases in LLM judges
Human evaluation design for generative systems
Conclusion

You are designing a RAG-based customer support assistant for an e-commerce platform. A user asks, “Can I return my opened headphones?” The system retrieves a return-policy document and generates an answer. The answer may sound fluent and confident. How do you know it is correct? How do you know the system retrieved the correct policy document and did not hallucinate a 90-day return window that is not in the policy? This evaluation problem is central to many generative AI system design interviews, and classification and ranking metrics alone are not enough here.

Traditional metrics like BLEU and ROUGE were designed for a world where one correct reference answer exists. BLEU measures n-gram precision between a generated output and a reference translation, while ROUGE measures recall of reference n-grams in the generated text. Both assume that lexical overlap with a gold-standard answer signals quality. For open-ended generation tasks such as summarization, conversational agents, and question answering over retrieved documents, this assumption collapses. Many valid outputs exist for the same input. A response can score high on ROUGE yet hallucinate facts simply because the hallucinated text happens to share n-grams with the reference. Conversely, a perfectly accurate answer phrased differently scores low.

This is not a minor inconvenience. It is a fundamental mismatch between the metric and the task.

Generative AI evaluation must separately assess the retrieval component and the generation component. If the system produces a bad answer, you need to know whether the retriever surfaced the wrong documents or the generator fabricated claims beyond what was retrieved. This diagnostic separation is what industry practitioners emphasize as the critical nuance for production systems. The structured framework that replaces single-score metrics is the RAG triad, which evaluates faithfulness, context relevance, and groundedness as independent dimensions.

The following table clarifies where traditional metrics break down and when the RAG triad becomes necessary:

Comparison of Evaluation Metrics for Generative AI

Metric	What It Measures	Failure Mode for Generative AI	When Still Useful
BLEU	N-gram precision against a reference	Penalizes valid paraphrases that use different vocabulary	Machine translation with constrained outputs
ROUGE	Recall of reference n-grams in generated text	Rewards hallucinated text that happens to overlap with the reference	Extractive summarization
BERTScore	Semantic similarity via contextual embeddings	Misses factual errors when semantics are similar but claims are wrong	Semantic similarity screening as a first pass
RAG Triad	Faithfulness plus relevance plus groundedness	Requires careful prompt design or human annotation to compute	RAG-based systems and open-ended generation

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Evaluation for Generative AI Systems

Comparison of Evaluation Metrics for Generative AI

The RAG triad as an evaluation framework