BLEU and ROUGE
Explore the BLEU and ROUGE evaluation metrics to understand how they measure the quality of language model outputs. Learn the core concepts of n-gram precision and recall, their calculation methods, and their appropriate applications in machine translation and summarization. This lesson also covers the limitations of these metrics in open-ended tasks and how to implement them practically, preparing you to discuss these techniques effectively in AI interviews.
We'll cover the following...
Top AI and tech companies now expect candidates to understand language model evaluation metrics beyond just perplexity, which was previously discussed. Metrics like BLEU and ROUGE are commonly discussed in interviews, as they evaluate the quality of generated outputs rather than just how well a model predicts the next token.
Interviewers want to see if candidates grasp what BLEU and ROUGE measure, when they are appropriate to use, and their limitations, especially in open-ended tasks like conversational AI. This demonstrates the difference between internal model confidence (perplexity) and output quality (BLEU/ROUGE).
Strong candidates explain how to choose the right metric for each task and critically assess their strengths and weaknesses, showing they don’t apply BLEU/ROUGE everywhere. While perplexity indicates how well a model learns language patterns, BLEU and ROUGE are crucial for evaluating the relevance and usefulness of actual model outputs in real-world applications.
What is BLEU, and how does it evaluate machine translation?
BLEU (Bilingual Evaluation Understudy) is an automatic metric originally designed for evaluating machine translation. Intuitively, BLEU scores a candidate translation by checking how many n-grams (contiguous word sequences) it shares with one or more human reference translations. More formally, BLEU computes n-gram precision: for each n (typically 1≤n≤4), it counts the fraction of n-grams in the model’s output that appear in the reference(s). These precisions are then typically combined (using the geometric mean) across n-gram orders. BLEU also includes a brevity penalty: if the generated translation is too short compared to the reference, BLEU penalizes it to avoid “cheating” by omitting content.
The BLEU score ranges from 0 to 1, where 1 indicates a perfect match to the references. For example, if the reference is “The cat is on the mat,” and the model output is “The cat sits on the mat,” most words and bi-grams overlap, so that the BLEU score would be relatively high.
Let’s now make this concrete with a small example:
Reference: “The cat is on the mat.”
Candidate: “The cat sat on the mat.”
Step 1: Tokenize both sentences:
Reference tokens:
[the, cat, is, on, the, mat]Candidate tokens:
[the, cat, sat, on, the, mat]
Step 2: Extract 1-grams:
Position | Candidate 1-gram | In Reference? |
1 | the | yes |
2 | cat | yes |
3 | sat | no |
4 | on | yes |
5 | the | yes |
6 | mat | yes |
Step 3: Count candidate and reference unigrams:
the | 2 | 2 | 2 |
cat | 1 | 1 | 1 |
sat | 1 | 0 | 0 |
on | 1 | 1 | 1 |
mat | 1 | 1 | 1 |
Total matched 1-grams (clipped): 2 + 1 + 1 + 1 = 5
Step 4: Calculate modified precision:
Total candidate 1-grams: 6
Matched (clipped): 5
1-gram precision = 5/6≈0.833
Step 5: Apply Brevity Penalty (Optional for BLEU-1):
Reference length: 6
Candidate length: 6
As candidate length = reference length → BP = 1
Step 6: Compute final BLEU-1 score:
BLEU-1 = BP × Precision = 1 × 0.833 = 0.833
Interview trap: An interviewer might ask, “Why does BLEU use ‘clipped’ counts instead of raw counts?” and candidates sometimes struggle to explain this.
The clipping prevents gaming the metric! Without clipping, a candidate could repeat a single matching word many times (e.g., “the the the the the the”) and achieve high precision. Clipping caps each n-gram’s count at its maximum occurrence in any reference, so repeating “the” 100 times when the reference only has 2 occurrences yields only 2 matches. This “modified precision” is crucial for BLEU’s validity.
Interpretation:
The candidate matched 5 out of 6 unigrams from the reference.
The one mismatch (“sat” vs. “is”) reduced precision.
Final BLEU-1 score: ~83.3%, reflecting strong surface similarity with one substitution.
In practice, BLEU is computed with a formula like:
Where:
is the modified n-gram precision ...