Reproducibility and Prompt Brittleness
Understand why large language models produce varying outputs for the same prompt due to probabilistic sampling and sensitivity to prompt changes. Learn to manage this variability by controlling parameters such as temperature, applying prompt versioning, and validating outputs. This lesson equips you to build more reliable and consistent LLM-powered systems by addressing reproducibility challenges and prompt fragility.
When you run the same prompt through an LLM twice and get two different answers, you are not witnessing a bug. You are observing a fundamental property of how these models generate text. In traditional software, identical inputs produce identical outputs every time. LLMs break that expectation, and this lesson explains why. Understanding non-determinism and the fragility of prompts is essential before deploying any LLM-powered system in a production environment, because variability that seems harmless in a demo can become a serious governance problem at enterprise scale.
The previous lesson noted that smaller models and aggressive caching can introduce output variability. This lesson explores that variability in depth, tracing it to its root cause and introducing the controls available to manage it.
How token sampling creates variability
LLMs generate text one token at a time. At each step, the model computes a probability distribution over its entire vocabulary. Instead of always picking the single most likely token, the model samples from this distribution. Each generation pass can therefore follow a different path through the space of possible continuations, even when the prompt is identical.
Consider an automated compliance summarizer that a financial services firm runs nightly. On Monday, it describes a regulatory filing as presenting “moderate risk.” On Tuesday, with the same document and prompt, it describes the filing as presenting “notable concern.” Neither ...