...

/

Elo Rating Systems for LLMs

Elo Rating Systems for LLMs

Learn how the Elo rating system transforms pairwise human judgments into dynamic, scalable rankings for evaluating large language models.

The Elo rating system was originally developed for ranking players in competitive games like chess. It assigns each player a numerical rating that rises or falls based on game outcomes. In chess, for example, a newcomer might start around 1200 points and gain or lose points after each match, depending on whether they win or lose and how strong their opponent was. Over time, Elo ratings reflect players’ relative skill levels—a higher rating means the player is expected to win more often against lower-rated opponents.

Recently, this system has been adapted to evaluate LLMs by treating model comparisons like games, and it is an area that companies are exploring to see how their models perform in the real world compared to other benchmarks. Instead of chess players, we have language models; instead of a chess match, we have a pairwise comparison of their answers. Two models (say Model A and Model B) are given the same prompt, and a human judge decides which model’s response is better. We can think of the “winner” of this comparison as winning a game. Elo would update a chess player’s rating after a match, so we update the models’ Elo scores based on the comparison outcome. Over many such comparisons, the Elo scores rank the models from strongest to weakest in performance. This approach has become popular for open-ended LLM evaluation, where direct automatic metrics fail—crowdsourced pairwise voting with Elo provides an intuitive and scalable way to construct a leaderboard of models.

Press + to interact

How do Elo ratings work for LLM evaluation?

The Elo system in this context operates through pairwise battles and iterative updates. Here are the core mechanics step-by-step:

  • Initial rating: All models start with an initial Elo score (1000 points is a common baseline). Before any comparisons, we assume no model has proven superiority, so they are “tied” at the same starting rating.

  • Pairwise comparison: We evaluate models in pairs. Model A and Model B receive the same prompt or question, each producing a response. A human evaluator (sometimes multiple evaluators) compares the two responses side-by-side and decides which model performed better. This decision could be based on criteria like correctness, coherence, relevance, or overall quality of the answer. (If the judge cannot choose a clear winner – for example, if both answers are equally good or bad—a tie can be declared, too.)

  • Expected score calculation: The Elo system computes an expected outcome for the match based on the models’ current ratings before updating ratings. If Model A’s rating is much higher than Model B’s, Model A is expected to win most of the time, and vice versa for a lower-rated model. A logistic formula gives the expected win probability. For model A vs. B, one common formula is:

Note: RAR_A​ and RBR_B​ are the current Elo ratings of A and B. This formula yields a number between 0 and 1. For example, if two models have equal ratings, each is expected to win about 50% of the time. A model 200 points higher ...