Reproducibility and Prompt Brittleness

Understand why large language models produce varying outputs for the same prompt due to probabilistic sampling and sensitivity to prompt changes. Learn to manage this variability by controlling parameters such as temperature, applying prompt versioning, and validating outputs. This lesson equips you to build more reliable and consistent LLM-powered systems by addressing reproducibility challenges and prompt fragility.

We'll cover the following...

How token sampling creates variability
Temperature and sampling parameters
- How temperature reshapes the distribution
- Complementary sampling controls
What prompt brittleness mean
Managing reproducibility in practice
Conclusion

When you run the same prompt through an LLM twice and get two different answers, you are not witnessing a bug. You are observing a fundamental property of how these models generate text. In traditional software, identical inputs produce identical outputs every time. LLMs break that expectation, and this lesson explains why. Understanding non-determinism and the fragility of prompts is essential before deploying any LLM-powered system in a production environment, because variability that seems harmless in a demo can become a serious governance problem at enterprise scale.

The previous lesson noted that smaller models and aggressive caching can introduce output variability. This lesson explores that variability in depth, tracing it to its root cause and introducing the controls available to manage it.

How token sampling creates variability

LLMs generate text one token at a time. At each step, the model computes a probability distribution over its entire vocabulary. Instead of always picking the single most likely token, the model samples from this distribution. Each generation pass can therefore follow a different path through the space of possible continuations, even when the prompt is identical.

Consider an automated compliance summarizer that a financial services firm runs nightly. On Monday, it describes a regulatory filing as presenting “moderate risk.” On Tuesday, with the same document and prompt, it describes the filing as presenting “notable concern.” Neither ...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

Reproducibility and Prompt Brittleness

How token sampling creates variability