Perplexity
Learn how to evaluate language models using perplexity—what it measures, how to implement it, and when it falls short.
Language model evaluation metrics are a common interview topic at leading AI labs and tech companies. Interviewers ask about this because understanding how to measure and compare model performance is fundamental to developing generative AI. For example, if you were building a machine translation system, your interviewers expect you to know how to quantitatively evaluate it (e.g., comparing model translations to human translations).
They want to see that you understand what “evaluation” means for LLMs and that you can discuss relevant metrics. At a high level, evaluation means quantifying how well a model’s outputs satisfy the task requirements. In language modeling, that often means comparing the model’s predictions (the probabilities it assigns or the text it generates) to ground-truth text or human references. A good candidate should mention held-out test sets and metrics like accuracy and perplexity to show they know how model performance is measured. Put simply, evaluation in LLMs means running the model on test data and computing numerical scores that reflect quality. For instance, you might compute how likely the model thinks the true test sentences are (this gives perplexity), or you might generate a translation or summary and compare it to a reference translation or summary. The key is that we need quantitative measures to compare models and track progress.
In this lesson, we will cover the necessary concepts one by one. We will explain conceptually what evaluation means and then define common metrics like perplexity. We will also discuss why perplexity is important in modern LLM work and examine its limitations.
What exactly is evaluation?
Evaluation is the process of measuring model performance. In artificial intelligence, we think of an evaluation pipeline where we take a test dataset (unseen during training), feed it into the model, and calculate one or more scores. These scores quantify how well the model did. For LLMs, evaluation often involves either the probabilities assigned to the held-out text or an overlap between the generated and reference text.
Think of the model as a student taking a test. The test questions are the input (like a prompt or a sentence with a blank), and the student’s answers are the model’s outputs (predicted words or labels). We grade the answers: if the model’s output matches the reference (the correct answer), it scores highly on the metric. The “grading rubric” is the evaluation metric. Just as students need exams and scoring rubrics to measure learning, models need test sets and metrics.
When evaluating an LLM, we often use probability-based metrics or text-overlap metrics. For example, if the model predicts the next word of a sentence, we can compute the probability the model assigns to the next word. Alternatively, if a model generates a full sentence (like a translation or summary), we can compare that generated sentence to a reference sentence using metrics. Evaluation means computing numbers from the model’s outputs and known answers (ground truth or reference text).
A diagram can ...