Search⌘ K
AI Features

Why Evaluating LLMs Is Hard

Explore the challenges in evaluating large language models, focusing on why traditional metrics fall short, the variability of outputs, and the necessity for multi-dimensional evaluation. Understand key quality dimensions like factual accuracy, coherence, and safety to design effective evaluation strategies for reliable AI systems.

Consider this scenario: a user asks an LLM to summarize a ten-page research paper on climate change. Three different runs of the same model produce three summaries. The first is a concise five-sentence paragraph. The second is a bulleted list of key findings. The third reads like an abstract written for a scientific journal. All three are factually accurate, relevant, and well written, yet they differ in structure, vocabulary, and emphasis. Which one is “correct”? In traditional machine learning, evaluation is straightforward because there is a single correct label or numeric value to compare against. Generative models shatter that assumption. They produce open-ended outputs where many valid responses exist simultaneously, and no single reference answer can serve as the definitive benchmark. This tension sits at the center of LLM evaluation and is the reason practitioners cannot simply reuse the metrics they learned in a standard ML course. Understanding why evaluation is hard is the prerequisite to choosing the right metrics and frameworks, which is exactly what this course will equip you to do.

The following diagram contrasts the clean, deterministic evaluation path of traditional ML with the ambiguous, multi-dimensional challenge that generative models introduce.

Traditional ML evaluation follows a single deterministic path to a score, while generative LLM evaluation must navigate multiple valid outputs across several quality dimensions simultaneously.
Traditional ML evaluation follows a single deterministic path to a score, while generative LLM evaluation must navigate multiple valid outputs across several quality dimensions simultaneously.

Open-ended outputs defy single answers

Unlike a spam classifier that outputs one of two labels, an LLM asked to write an email, generate code, or answer a question can produce thousands of valid completions. The output space is enormous, and several factors make it even more variable.

  • Temperature and sampling strategies: Higher temperature settings increase randomness in token selection, meaning the same prompt can yield noticeably different outputs across runs. Even with a fixed temperature, techniques like top-k and nucleus sampling introduce controlled variability.

  • Prompt phrasing sensitivity: Subtle changes in how a prompt is worded, such as asking “Summarize this article” vs. “What are the key takeaways from this article,” can shift the model’s output in tone, length, and focus.

  • Lack of a single ground truth: For tasks like summarization, creative writing, or open-domain ...