Why Evaluating LLMs Is Hard

Explore the challenges in evaluating large language models, focusing on why traditional metrics fall short, the variability of outputs, and the necessity for multi-dimensional evaluation. Understand key quality dimensions like factual accuracy, coherence, and safety to design effective evaluation strategies for reliable AI systems.

We'll cover the following...

Open-ended outputs defy single answers
Why traditional metrics fall short
- Exact-match accuracy and its limits
- N-gram overlap metrics
The multi-dimensional evaluation challenge
- Core evaluation dimensions
Conclusion

Consider this scenario: a user asks an LLM to summarize a ten-page research paper on climate change. Three different runs of the same model produce three summaries. The first is a concise five-sentence paragraph. The second is a bulleted list of key findings. The third reads like an abstract written for a scientific journal. All three are factually accurate, relevant, and well written, yet they differ in structure, vocabulary, and emphasis. Which one is “correct”? In traditional machine learning, evaluation is straightforward because there is a single correct label or numeric value to compare against. Generative models shatter that assumption. They produce open-ended outputs where many valid responses exist simultaneously, and no single reference answer can serve as the definitive benchmark. This tension sits at the center of LLM evaluation and is the reason practitioners cannot simply reuse the metrics they learned in a standard ML course. Understanding why evaluation is hard is the prerequisite to choosing the right metrics and frameworks, which is exactly what this course will equip you to do.

The following diagram contrasts the clean, deterministic evaluation path of traditional ML with the ambiguous, multi-dimensional challenge that generative models introduce.

Open-ended outputs defy single answers

Unlike a spam classifier that outputs one of two labels, an LLM asked to write an email, generate code, or answer a question can produce thousands of valid completions. The output space is enormous, and several factors make it even more variable.

Temperature and sampling strategies: Higher temperature settings increase randomness in token selection, meaning the same prompt can yield noticeably different outputs across runs. Even with a fixed temperature, techniques like top-k and nucleus sampling introduce controlled variability.
Prompt phrasing sensitivity: Subtle changes in how a prompt is worded, such as asking “Summarize this article” vs. “What are the key takeaways from this article,” can shift the model’s output in tone, length, and focus.
Lack of a single ground truth: For tasks like summarization, creative writing, or open-domain ...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

Why Evaluating LLMs Is Hard

Open-ended outputs defy single answers