Evaluating AI and LLM Models

Learn to evaluate AI and large language models by understanding metrics like perplexity, BLEU, ROUGE, and benchmarking. Discover their strengths, limitations, and how to combine multiple metrics smartly. Understand issues with leaderboards, the use of human and LLM evaluations, and techniques to critique and implement scoring systems like BLEU. This lesson equips you to assess model quality beyond surface scores and prepare for interview questions on model evaluation.

We'll cover the following...

What is perplexity and what are its limitations?
How do BLEU and ROUGE work and when should you use them?
What are LLM benchmarks and why should you distrust leaderboards?
What is Chatbot Arena and why is Elo a better signal?
What is LLM-as-Judge and what are its biases?
How would you implement BLEU score from scratch?
What’s next?

Evaluation is the unglamorous half of AI engineering that separates teams that ship reliable systems from teams that are perpetually surprised by production failures. Every metric in this lesson has a failure mode. Every benchmark has been gamed. The skill interviewers are testing is not whether you can recite BLEU’s formula, but whether you understand what each metric actually measures, what it misses, and when you would choose one over another.

Important: Model evaluation is fundamentally unsolved. There is no metric that perfectly captures “this model is good.” Every metric is a proxy. Your job as an AI engineer is to pick the right combination of proxies for your specific task, understand their blind spots, and layer them to catch what individual metrics miss. A candidate who treats any single metric as ground truth will raise red flags.

What is perplexity and what are its limitations?

Perplexity measures how surprised a language model is by a held-out test corpus. A model assigns a probability to each token in the test sequence conditioned on all previous tokens. Perplexity is the exponentiation of the average negative log-probability:

Intuitively, perplexity equals the effective vocabulary size the model is choosing from at each step. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly between 10 options. Lower is better. A model that perfectly predicts the test set would have perplexity 1.

Perplexity is fast to compute, does not require human annotation, and is a reliable signal for comparing two models of the same architecture on the same distribution. It is the primary metric used during pretraining to track learning progress.

The limitations are significant. Perplexity is distribution-sensitive: a model trained on web text will have low perplexity on web text but high perplexity on medical records, even if the medical-domain model is objectively worse at web text tasks. You cannot meaningfully compare perplexity scores across different test sets. Perplexity also says nothing about correctness, coherence, factual accuracy, or helpfulness. A model that confidently makes up plausible-sounding facts can have excellent perplexity. For generative tasks, perplexity correlates weakly with what users actually care about.

How do BLEU and ROUGE work and when should you use them?

Both BLEU and ROUGE measure the n-gram overlap between a generated text and one or more human reference texts. They are the workhorses of automatic evaluation for translation and summarization respectively.

BLEU (Bilingual Evaluation Understudy) is precision-oriented. For each n-gram size from 1 to 4, it counts ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Evaluating AI and LLM Models

What is perplexity and what are its limitations?

How do BLEU and ROUGE work and when should you use them?