BERTScore and Model-Based Evaluation
Explore how BERTScore measures semantic similarity using contextual embeddings to evaluate LLM outputs beyond simple word overlap. Understand G-Eval's use of GPT-4 for reference-free, human-aligned evaluation on coherence, fluency, and relevance. This lesson equips you with tools to apply multi-metric LLM evaluation pipelines for reliable quality assessment.
In the previous lesson, we saw that BLEU, ROUGE, and METEOR all share a fundamental limitation: they measure quality by counting overlapping words or n-grams between a candidate and a reference. Consider the sentence pair “The automobile was repaired” vs. “The car got fixed.” A human reader immediately recognizes these as semantically identical, yet BLEU scores this pair near zero because the sentences share almost no exact tokens. This gap between surface-level matching and actual meaning becomes a serious problem in production LLM applications, where outputs are creative, paraphrastic, and rarely mirror a reference word for word. N-gram metrics simply cannot keep up.
This lesson introduces two evaluation methods that operate at the semantic level. BERTScore uses contextual embeddings from pretrained transformer models to measure meaning-level similarity between tokens, giving full credit to paraphrases. G-Eval takes a different path entirely, using GPT-4 as an automated evaluator that judges outputs on human-aligned quality dimensions like coherence, fluency, and relevance. Together, these methods fill the gap that n-gram metrics leave open, and they form the backbone of modern automated evaluation pipelines. AWS SageMaker Model Monitor can track these richer metrics alongside traditional ones for continuous performance monitoring in deployed systems.
How BERTScore works
BERTScore is a metric that computes semantic similarity between a candidate sentence and a reference sentence by leveraging
The computation follows three distinct steps.
Step 1: Generate contextual embeddings
Both the candidate and reference sentences are tokenized and passed through a pretrained transformer encoder. Each token receives a contextual embedding vector. The word “bank” in “river bank” produces a different vector than “bank” in “savings bank,” which is precisely why these embeddings capture meaning rather than just spelling. ...