Implementing Safety Guardrails
Learn how to use Llama Stack’s Safety API to filter potentially harmful content. Register and apply safety shields to agents, protecting both input and output through a structured, provider-based moderation system.
Generative AI models are powerful, but not without risk. They may produce harmful, offensive, biased, or unsafe content, especially when prompted with adversarial or ambiguous inputs. In many production applications, this is unacceptable.
That’s why Llama Stack includes a built-in Safety API and a system of configurable shields that allow developers to enforce safety guardrails at multiple points in the interaction pipeline.
In this lesson, we’ll learn how to register and apply a shield like llama_guard
, attach it to an agent, and observe how unsafe content is intercepted before it can be processed or returned. These tools help us build more trustworthy, responsible applications.
Why safety matters
Even well-designed prompts and helpful models can produce unsafe or inappropriate outputs under the right conditions. Consider the following risks:
Toxicity: Hate speech, slurs, or personal attacks
Violence: Descriptions or encouragement of harm
Self-harm: Responses to mental health questions that are inaccurate or unsafe ...