Llama Stack: From Fundamentals to Deployment/

...

Using the Inference API for Text Generation

Learn how to construct structured requests to Llama Stack's Inference API using system and user prompts, and experiment with generation parameters to influence model output.

We'll cover the following...

What is the inference API?
Anatomy of an inference request
Constructing and sending an inference request
The role of the system message
- Challenge
Exploring generation parameters
Streaming completions
Multi-turn conversations and maintaining context

Now that we’ve created our first Llama Stack application using a preconfigured distribution, it’s time to learn how the most fundamental building block, inference, really works. While agents, tools, and RAG workflows offer advanced orchestration, nearly everything in Llama Stack starts with inference: the ability to generate text responses from a large language model.

In this lesson, we’ll explore the Inference API in detail, experiment with how prompts and parameters influence model behavior, and build a lightweight chatbot that supports memory via session-like message history. You’ll finish with a solid grasp of how to use Llama Stack to generate controlled, consistent, and useful outputs.

What is the inference API?

The inference API is the most direct way to interact with a language model in Llama Stack. It provides a consistent interface to send a list of messages (in the form of a chat history) and receive a generated response from a model.

The API is modeled after familiar chat interfaces like OpenAI’s chat/completions, but with a more structured, provider-agnostic format. Whether you’re using Together, Ollama, or another backend, the request format remains the same.

Calling the inference API involves:

Selecting a registered model (model_id).
Sending a list of structured messages.
(Optionally) specifying generation parameters like temperature, max_tokens, and top_p.
Receiving a structured response with the model’s output.

Press + to interact

Getting Started with Llama Stack

Core Building Blocks: Architecture and Inference

Agents, Tools, and Retrieval with Llama Stack

Safety, Monitoring, and Evaluation

Advanced Integration and Beyond

Conclusion

Using the Inference API for Text Generation

What is the inference API?

Anatomy of an inference request