Hands-On Lab: Multi-Metric Evaluation
Explore how to evaluate large language models using diverse metrics across tasks such as summarization, open-domain question answering, and retrieval-augmented generation. Understand metric roles, implement practical evaluations, and analyze complementary scores to build reliable, multi-metric pipelines for production LLM systems.
Evaluating LLM outputs is not a single-metric problem. A summarization model might score well on word overlap but completely miss the meaning of the source text. A question-answering system might produce a factually correct response that shares almost no words with the reference answer. A RAG pipeline might retrieve perfect context but still hallucinate in its final response. Each of these failure modes requires a different evaluation lens, which is exactly why production systems combine multiple metrics rather than relying on any one score.
In the previous lesson, you explored RAGAS metrics for RAG pipeline evaluation. This lab extends that foundation across two additional task types, summarization and open-domain QA, giving you hands-on experience with five distinct evaluation approaches. By the end, you will have computed ROUGE, METEOR, and BERTScore for summarization, used G-Eval as an LLM judge for QA, run RAGAS on a RAG pipeline, and compared what each metric reveals about model quality.
The environment setup requires installing a handful of Python packages.
rouge-score: Provides the ROUGE metric implementation for measuring n-gram overlap between generated and reference text.
nltk: Contains the METEOR scoring function, which extends overlap matching with synonym and stemming support.
bert-score: Computes semantic similarity using contextual embeddings from pretrained transformer models.
openai: Powers the G-Eval LLM-as-judge calls through the GPT-4 API.
ragas: Runs the full RAGAS evaluation suite for RAG pipelines.
All sample data is defined inline within the code so you can focus entirely on evaluation logic. Each section produces numeric scores, and the final section synthesizes everything into a comparison table.
Note: In production, Amazon SageMaker’s built-in evaluation capabilities support ROUGE, METEOR, and BERTScore natively. This lab uses standalone Python for full transparency into how each metric works under the hood.
Let’s start with the evaluation track.
Evaluating summarization with ROUGE, METEOR, and BERTScore
Summarization evaluation answers a deceptively simple question: does the generated summary capture the same information as the reference? The challenge is that “same information” can mean surface-level word overlap or deeper semantic equivalence, and different metrics target different definitions.
What each metric captures
A quick refresher on the three metrics you will compute, without repeating the full theory from earlier lessons.
ROUGE measures n-gram overlap between the generated and ...