Using the Inference API for Text Generation
Learn how to construct structured requests to Llama Stack's Inference API using system and user prompts, and experiment with generation parameters to influence model output.
Now that we’ve created our first Llama Stack application using a preconfigured distribution, it’s time to learn how the most fundamental building block, inference, really works. While agents, tools, and RAG workflows offer advanced orchestration, nearly everything in Llama Stack starts with inference: the ability to generate text responses from a large language model.
In this lesson, we’ll explore the Inference API in detail, experiment with how prompts and parameters influence model behavior, and build a lightweight chatbot that supports memory via session-like message history. You’ll finish with a solid grasp of how to use Llama Stack to generate controlled, consistent, and useful outputs.
What is the inference API?
The inference API is the most direct way to interact with a language model in Llama Stack. It provides a consistent interface to send a list of messages (in the form of a chat history) and receive a generated response from a model.
The API is modeled after familiar chat interfaces like OpenAI’s chat/completions
, but with a more structured, provider-agnostic format. Whether you’re using Together, Ollama, or another backend, the request format remains the same.
Calling the inference API involves:
Selecting a registered model (
model_id
).Sending a list of structured messages.
(Optionally) specifying generation parameters like
temperature
,max_tokens
, andtop_p
.Receiving a structured response with the model’s output.
This makes it ideal for lightweight use cases such as:
Prompt experimentation
One-shot or few-shot generation
Code completion
Chatbot scaffolding
Anatomy of an inference request
The message list is the core of every inference request. It is a sequence of objects that represent dialogue turns between different roles.
Each message has a role
and content
, where the role is one of: