RAGAS: Evaluating RAG Pipelines End-to-End
Understand how to evaluate retrieval-augmented generation pipelines end-to-end by combining automated metrics with human judgment. Learn to create detailed evaluation dimensions, design effective Likert scales with anchors, write clear annotation guidelines, and measure inter-annotator reliability using Cohen's Kappa. This lesson helps you apply rigorous evaluation methods critical for high-stakes and nuanced AI applications.
Automated metrics like BERTScore and G-Eval offer scalable ways to evaluate LLM outputs, but they remain approximations of human judgment. An embedding-based similarity score cannot detect a subtle factual error buried in an otherwise fluent paragraph. A model-based evaluator might rate culturally inappropriate phrasing as perfectly coherent because the grammar is correct. These metrics correlate with human ratings on average, yet they diverge precisely where it matters most: on edge cases involving nuanced factual errors, awkward phrasing that embeddings treat as natural, or outputs that are technically correct but miss the user’s intent entirely.
Human evaluation remains the gold standard whenever subjective judgment is required, when deploying in high-stakes domains like healthcare or legal, and when validating that automated metrics actually track quality for a specific task. By the end of this lesson, you will be able to design a complete human evaluation study with defined quality dimensions, Likert scales, annotation guidelines, and reliability measurement using Cohen’s Kappa.
Evaluation dimensions and Likert scales
Before asking a human annotator to rate an LLM output, you need to specify exactly what they are rating. Vague instructions like “rate the quality” produce noisy, inconsistent data because each annotator interprets “quality” differently. Instead, evaluation studies decompose quality into distinct dimensions, each targeting a specific aspect of the output.
Four dimensions appear most frequently in LLM evaluation research.
Fluency measures whether the text is grammatically correct and reads naturally, as a native speaker would expect.
Coherence captures the logical flow and consistency across sentences, ensuring the output does not contradict itself or jump between unrelated ideas.
Faithfulness assesses whether every claim in the output is grounded in a provided source document, making it critical for summarization and RAG pipelines.
Relevance determines whether the output actually addresses the user’s prompt rather than producing well-written but off-topic content.
Designing Likert scales with anchors
Each dimension is measured using a
Consider a 5-point Likert scale for faithfulness. A score of 1 means “multiple fabricated facts with no basis in the source.” A score of 3 means “some claims are supported but others are unverifiable.” A score of 5 means “every claim is directly traceable to the source document.” Without these anchors, one annotator’s 3 might be another annotator’s 4.
The number of scale points also matters. Five-point scales balance granularity with cognitive load, making them the most common choice. Seven-point scales can increase sensitivity for fine-grained distinctions but tend to increase annotator fatigue on large batches. For most LLM evaluation tasks, five points provide sufficient resolution.
Practical tip: Write your anchors as observable behaviors (“multiple fabricated facts”) rather than subjective judgments (“poor faithfulness”). Observable anchors reduce ambiguity and improve agreement between annotators.
These dimensions map directly ...