Evaluating Large Language Models

Explore how to evaluate large language models by understanding core intrinsic and extrinsic metrics such as perplexity, BLEU, FID, and human assessments. Learn why evaluation is critical to measure accuracy, compare models, and address challenges to improve real-world AI applications.

We'll cover the following...

Why do we evaluate LLMs?
What are intrinsic evaluation metrics?
What are extrinsic evaluation metrics?
What are additional considerations while evaluating LLMs?
What are the challenges in evaluating LLMs?

We’ve seen how foundation models are built through large-scale pretraining, refined with post-training and fine-tuning techniques, and optimized for real-world deployment. But how do we know if all these steps actually work? That’s where evaluation comes in, measuring how well models perform across tasks and revealing both their strengths and limitations.

Why do we evaluate LLMs?

Imagine you’ve built an incredible car. It’s sleek, powerful, and packed with features. But how do you know if it’s the best on the market? You’d test it on the road and check its fuel efficiency, safety ratings, acceleration, etc. In the same way, evaluating LLMs is like putting them through a series of tests to see how well they perform. Evaluation helps us:

Measure performance: We want to know how accurately a model predicts or generates text.
Compare models: With so many LLMs available, metrics let us compare one model against another.
Understand trade-offs: Some models might be excellent at understanding context but not as good at generating text, and vice versa.
Guide improvements: Knowing where a model falls short helps researchers improve it.

Evaluations can be split into two broad categories: intrinsic and extrinsic metrics. Intrinsic metrics focus on the model’s language modeling capabilities, while extrinsic metrics assess performance on specific downstream tasks.

What are intrinsic evaluation metrics?

Intrinsic evaluation measures how well a model performs on the tasks it was trained on, often by comparing its predictions to a reference or gold standard. Two of the most commonly used intrinsic metrics are perplexity and BLEU.

What is perplexity?

Perplexity measures how well a probability model (like a language model) predicts a sample. It tells us how surprised the model is when it sees the test data. A lower perplexity means the model is less surprised, generally indicating that it predicts the next word in a sentence better. Imagine you’re reading a mystery novel, and every time you turn the page, you’re trying to predict what will happen next. If you can almost always guess correctly, you’re not very surprised by what happens—you have low perplexity. But if the story throws unexpected twists at you, you’re constantly caught off guard—this is high perplexity.

1.Introduction to Generative AI

2.Building Blocks of Generative AI

3.Foundation Models

Project

4.Intelligent Interaction with GenAI

5.Practical Applications and Case Studies

6.Future of Generative AI and Wrap Up

Evaluating Large Language Models

Why do we evaluate LLMs?

What are intrinsic evaluation metrics?

What is perplexity?