Search⌘ K

Evaluating Large Language Models

Explore how systematic evaluation using intrinsic and extrinsic metrics reveals large language models’ capabilities, trade-offs, and limitations.

We’ve seen how foundation models are built through large-scale pretraining, refined with post-training and fine-tuning techniques, and optimized for real-world deployment. But how do we know if all these steps actually work? That’s where evaluation comes in, measuring how well models perform across tasks, revealing both their strengths and their limitations.

Why do we evaluate LLMs?

Imagine you’ve built an incredible car. It’s sleek, powerful, and packed with features. But how do you know if it’s the best on the market? You’d test it on the road and check its fuel efficiency, safety ratings, acceleration, etc. In the same way, evaluating LLMs is like putting them through a series of tests to see how well they perform. Evaluation helps us:

  • Measure performance: We want to know how accurately a model predicts or generates text.

  • Compare models: With so many LLMs available, metrics let us compare one model against another.

  • Understand trade-offs: Some models might be excellent at understanding context but not as good at generating text, and vice versa.

  • Guide improvements: Knowing where a model falls short helps researchers improve it.

Evaluations can be split into two broad categories: intrinsic and extrinsic metrics. Intrinsic metrics focus on the model’s language modeling capabilities, while extrinsic metrics assess performance on specific downstream tasks.

What are intrinsic evaluation metrics?

Intrinsic evaluation measures how well a model performs on the tasks it was trained on, often by comparing its predictions to a reference or gold standard. Two of the most commonly used intrinsic metrics are perplexity and BLEU.

What is perplexity?

Perplexity measures how well a probability model (like a language model) predicts a sample. It tells us how surprised the model is when it sees the test data. A lower perplexity means the model is less surprised, generally indicating that it predicts the next word in a sentence better. Imagine you’re reading a mystery novel, and every time you turn the page, you’re trying to predict what will happen next. If you can almost always guess correctly, you’re not very surprised by what happens—you have low perplexity. But if the story throws unexpected twists at you, you’re constantly caught off guard—this is high perplexity.

In a language model, perplexity is calculated using the probabilities the model assigns to the words in a sentence. The overall perplexity is low if the model is confident in its predictions (assigning high probabilities to the correct words). Mathematically, if a model assigns a probability P(wi)P(w_i) ...