Search⌘ K
AI Features

Perplexity

Explore how to evaluate language models using perplexity, a key metric that reflects prediction accuracy by measuring model surprise on test data. Understand its calculation, role in comparison, and limitations, as well as how it fits alongside metrics like BLEU and ROUGE for comprehensive assessment.

Language model evaluation metrics are a common topic of discussion in interviews because measuring and comparing model performance is fundamental to the development of generative AI. Interviewers want to know that you understand what evaluation means for LLMs, how predictions are compared with ground-truth text or human references, and why held-out test sets and metrics such as accuracy and perplexity matter. Evaluation is the process of running a model on test data and computing numerical scores that reflect quality, for example, by measuring how likely the model finds true sentences or how closely generated outputs match references. In this lesson, we will define these concepts, introduce common metrics such as perplexity, and discuss both their importance and their limitations.

What does evaluation mean for language models?

Evaluation is the process of measuring model performance. In artificial intelligence, we think of an evaluation pipeline where we take a test dataset (unseen during training), feed it into the model, and calculate one or more scores. These scores quantify how well the model did. For LLMs, evaluation often involves either the probabilities assigned to the held-out text or an overlap between the generated and reference text.

Think of the model as a student taking a test. The test questions are the input (like a prompt or a sentence with a blank), and the student’s answers are the model’s outputs (predicted words or labels). We grade the answers: if the model’s output matches the reference (the correct answer), it scores highly on the metric. The “grading rubric” is the evaluation metric. Just as students need exams and scoring rubrics to measure learning, models need test sets and metrics.

When evaluating an LLM, we often use probability-based metrics or text-overlap metrics. For example, if the model predicts the next word in a sentence, we can compute the probability that the model assigns to that word. Alternatively, if a model generates a full sentence (such as a translation or summary), we can compare the generated sentence to a reference sentence using metrics. Evaluation means computing numbers from the model’s outputs and known answers (ground truth or reference text).

A diagram can illustrate this concept: the model takes in some context and produces probabilities or text. We then compare that output to the reference and compute a metric score.

Educative byte: The distinction between intrinsic and extrinsic evaluation is important in NLP research. Intrinsic metrics (like perplexity) measure model performance on the language modeling task itself. Extrinsic metrics measure how well the model performs on downstream tasks (like question answering or sentiment analysis). A model with excellent perplexity might still perform poorly on specific applications, which is why comprehensive evaluation often includes both types.

In practice, we often split data into training and held-out test (or validation) sets. We train the model on the training data and evaluate it on the held-out data. The candidate should mention test/validation sets and the idea of ground truth. For example: “We measure accuracy or BLEU on a test set that the model hasn’t seen” or “we compute perplexity on held-out text.” The interviewer wants to hear terms like test set, held-out data, reference text, and evaluation metric. Clarifying that evaluation is about comparing model outputs to expected results using metrics ...