Where Prompts Belong in an Evaluated System

Explore how treating prompts as software artifacts within evaluated AI systems enhances traceability and evaluation rigor. Understand the challenges of prompt management, the benefits of version control, and how to prevent failures caused by disconnects between prompt iteration and system behavior. This lesson helps ensure prompt changes are trackable, reviewed, and integrated to maintain system reliability and ease debugging.

We'll cover the following...

Why do prompts become a bottleneck once evaluation starts working?
- What happens when multiple people modify prompts?
Should prompts live close to the code or be accessible to non-technical stakeholders?
- How does storing prompts in Git help?
- What goes wrong when prompts live in external tools instead?
Why do prompt management tools struggle with real systems?
- Are prompt tools useless?
What’s next?

This is where many teams lose rigor, and prompts drift away from the rest of the system. They reside in dashboards, admin panels, notebooks, or shared documents, while evaluation findings are stored elsewhere. When a failure appears in a trace, it becomes hard to answer basic questions: which prompt version caused this, who changed it, and whether the fix you discussed actually shipped. This lesson focuses on that breakdown and on how to treat prompts as part of the evaluable system, not as isolated text.

Why do prompts become a bottleneck once evaluation starts working?

Once you start reviewing traces regularly and running evaluations against real failures, prompt changes accelerate. A single failure may trigger several iterations, such as tightening an instruction, adding a constraint, clarifying refusal behavior, or restructuring the context. At this stage, many teams discover that they can no longer answer basic questions, such as which prompt produced a given trace, ...

1.Foundations of AI Evaluation

2.Building the Evaluation Workflow

3.Scaling Evaluation Beyond the Basics

4.Evaluating Real Systems in Production

5.Wrap Up

Where Prompts Belong in an Evaluated System

Why do prompts become a bottleneck once evaluation starts working?