BERTScore and Model-Based Evaluation

Explore how BERTScore measures semantic similarity using contextual embeddings to evaluate LLM outputs beyond simple word overlap. Understand G-Eval's use of GPT-4 for reference-free, human-aligned evaluation on coherence, fluency, and relevance. This lesson equips you with tools to apply multi-metric LLM evaluation pipelines for reliable quality assessment.

We'll cover the following...

How BERTScore works
BERTScore strengths and limits
G-Eval and LLM-as-a-judge
- The G-Eval pipeline
- Practical considerations for G-Eval
Conclusion

In the previous lesson, we saw that BLEU, ROUGE, and METEOR all share a fundamental limitation: they measure quality by counting overlapping words or n-grams between a candidate and a reference. Consider the sentence pair “The automobile was repaired” vs. “The car got fixed.” A human reader immediately recognizes these as semantically identical, yet BLEU scores this pair near zero because the sentences share almost no exact tokens. This gap between surface-level matching and actual meaning becomes a serious problem in production LLM applications, where outputs are creative, paraphrastic, and rarely mirror a reference word for word. N-gram metrics simply cannot keep up.

This lesson introduces two evaluation methods that operate at the semantic level. BERTScore uses contextual embeddings from pretrained transformer models to measure meaning-level similarity between tokens, giving full credit to paraphrases. G-Eval takes a different path entirely, using GPT-4 as an automated evaluator that judges outputs on human-aligned quality dimensions like coherence, fluency, and relevance. Together, these methods fill the gap that n-gram metrics leave open, and they form the backbone of modern automated evaluation pipelines. AWS SageMaker Model Monitor can track these richer metrics alongside traditional ones for continuous performance monitoring in deployed systems.

How BERTScore works

BERTScore is a metric that computes semantic similarity between a candidate sentence and a reference sentence by leveraging contextual embeddingsVector representations of tokens produced by transformer models, where each token's vector changes depending on the surrounding words, unlike static embeddings where a word always has the same vector. from pretrained transformer models such as RoBERTa or DeBERTa. Instead of checking whether the same words appear in both sentences, BERTScore checks whether the same meanings appear.

The computation follows three distinct steps.

Step 1: Generate contextual embeddings

Both the candidate and reference sentences are tokenized and passed through a pretrained transformer encoder. Each token receives a contextual embedding vector. The word “bank” in “river bank” produces a different vector than “bank” in “savings bank,” which is precisely why these embeddings capture meaning rather than just spelling. ...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

BERTScore and Model-Based Evaluation

How BERTScore works

Step 1: Generate contextual embeddings