BLEU and ROUGE

Learn how BLEU and ROUGE evaluate LLM outputs by matching n-grams to reference texts, and why they often fall short in real-world generative tasks.

Top AI and tech companies now expect candidates to understand language model evaluation metrics beyond just perplexity, which was previously discussed. Metrics like BLEU and ROUGE are commonly brought up in interviews, as they assess the quality of generated outputs rather than just how well a model predicts the next token.

Interviewers want to see if candidates grasp what BLEU and ROUGE measure, when they are appropriate to use, and their limitations, especially in open-ended tasks like conversational AI. This demonstrates the difference between internal model confidence (perplexity) and output quality (BLEU/ROUGE).

Strong candidates explain how to choose the right metric for each task and critically assess their strengths and weaknesses, showing they don’t apply BLEU/ROUGE everywhere. While perplexity indicates how well a model learns language patterns, BLEU and ROUGE are crucial for evaluating the relevance and usefulness of actual model outputs in real-world applications.

What exactly is BLEU?

BLEU (Bilingual Evaluation Understudy) is an automatic metric originally designed for evaluating machine translation. Intuitively, BLEU scores a candidate translation by checking how many n-grams (contiguous word sequences) it shares with one or more human reference translations. More formally, BLEU computes n-gram precision: for each n (typically 1≤n≤4), it counts the fraction of n-grams in the model’s output that appear in the reference(s)​. These precisions are then typically combined (geometric mean) across n-gram orders. BLEU also includes a brevity penalty: if the generated translation is too short compared to the reference, BLEU penalizes it to avoid “cheating” by omitting content​.

Press + to interact

The BLEU score ranges from 0 to 1, where 1 indicates a perfect match to the references. For example, if the reference is “The cat is on the mat” and the model output is “The cat sits on the mat,” most words and bi-grams overlap so that the BLEU score would be relatively high.

Let’s now make this concrete with a small example:

Reference: “The cat is on the mat.”
Candidate: “The cat sat on the mat.”

  • Step 1: Tokenize both sentences:

    • Reference tokens: [the, cat, is, on, the, mat]

    • Candidate tokens: [the, cat, sat, on, the, mat]

  • Step 2: Extract 1-grams:

Position

Candidate 1-gram

In Reference?

1

the

yes

2

cat

yes

3

sat

no

4

on

yes

5

the

yes

6

mat

yes

  • Step 3: Count candidate and reference unigrams:

the

2

2

2

cat

1

1

1

sat

1

0

0

on

1

1

1

mat

1

1

1

Total matched 1-grams (clipped): 2 + 1 + 1 + 1 = 5

  • Step 4: Calculate modified precision:

    • Total candidate 1-grams: 6

    • Matched (clipped): 5

    • 1-gram precision = 5/6​≈0.833

  • Step 5: Apply Brevity Penalty (Optional for BLEU-1):

    • Reference length: 6

    • Candidate length: 6

    • As candidate length = reference length → BP = 1

  • Step 6: Compute final BLEU-1 score:

    • BLEU-1 = BP × Precision = 1 × 0.833 = 0.833

Interpretation:

  • The candidate matched 5 out of 6 unigrams from the reference.

  • The one mismatch (“sat” vs. “is”) reduced precision.

  • Final BLEU-1 score: ~83.3%, reflecting strong surface similarity with one substitution.

In practice, BLEU is computed with a formula like:

Where:

  • pnp_n​ is the modified n-gram precision

  • wnw_n ...