Llama Stack: From Fundamentals to Deployment/

...

Evaluating Your AI Application’s Performance

Describe the importance of evaluating AI application performance and perform a basic evaluation of an application.

We'll cover the following...

Why evaluate?
The Llama Stack evaluation system
Define the evaluation dataset
- Choose a scoring function
Run the evaluation
Interpreting results
Using LLM-as-a-judge

Press + to interact

Llama Stack provides an Evaluation API that allows us to quantify how well our AI applications perform using structured scoring functions, benchmarks, and evaluation datasets. Whether we’re validating a chatbot’s factual correctness or testing a document-answering system, evaluation helps us detect issues and drive targeted improvements.

Why evaluate?

Without evaluation, we rely on anecdotal testing or manual reviews to judge performance. This doesn’t scale, and it misses subtle issues. Evaluation allows us to:

Score generated outputs against expected answers.
Track accuracy over time or across prompt versions.
Test agents against benchmark question sets.
Detect regressions after changes to prompts or tools.

Quantitative evaluation helps us turn intuition into measurable feedback and, ultimately, more reliable applications.

The Llama Stack evaluation system

Llama Stack provides a set of APIs and scoring functions designed for evaluating generative applications. The key APIs are:

scoring: Use a specific scoring function to compare model output to expected answers.
eval: Run evaluation jobs against datasets.
benchmark: Manage predefined evaluation datasets and track progress.

In this lesson, we’ll focus on the scoring API to evaluate agent output against a small set of expected answers.

Define the evaluation dataset

To evaluate our model’s outputs, we need a structured dataset that allows for comparison against known, expected responses. In Llama Stack, this is done using a list of evaluation rows, each one representing a single ...

Getting Started with Llama Stack

Core Building Blocks: Architecture and Inference

Agents, Tools, and Retrieval with Llama Stack

Safety, Monitoring, and Evaluation

Advanced Integration and Beyond

Conclusion

Evaluating Your AI Application’s Performance

Why evaluate?

The Llama Stack evaluation system

Define the evaluation dataset