...

/

Evaluating Your AI Application’s Performance

Evaluating Your AI Application’s Performance

Describe the importance of evaluating AI application performance and perform a basic evaluation of an application.

We’ve now built full agent workflows using tools, retrieval, and safety, and monitored their behavior using telemetry. But while observation tells us what happened, it doesn’t tell us how well our application is performing against its intended outcomes.

This is where evaluation comes in.

Press + to interact

Llama Stack provides an Evaluation API that allows us to quantify how well our AI applications perform using structured scoring functions, benchmarks, and evaluation datasets. Whether we’re validating a chatbot’s factual correctness or testing a document-answering system, evaluation helps us detect issues and drive targeted improvements.

Why evaluate?

Without evaluation, we rely on anecdotal testing or manual reviews to judge performance. This doesn’t scale, and it misses subtle issues. Evaluation allows us to:

  • Score generated outputs against expected answers.

  • Track accuracy over time or across prompt versions.

  • Test agents against benchmark question sets.

  • Detect regressions after changes to prompts or tools.

Quantitative evaluation helps us turn intuition into measurable feedback and, ultimately, more reliable applications.

The Llama Stack evaluation system

Llama Stack provides a set of APIs and scoring functions designed for evaluating generative applications. The key APIs are:

  • scoring: Use a specific scoring function to compare model output to expected answers.

  • eval: Run evaluation jobs against datasets.

  • benchmark: Manage predefined evaluation datasets and track progress.

In this lesson, we’ll focus on the scoring API to evaluate agent output against a small set of expected answers.

Define the evaluation dataset

To evaluate our model’s outputs, we need a structured dataset that allows for comparison against known, expected responses. In Llama Stack, this is done using a list of evaluation rows, each one representing a single ...