...

/

Using the Inference API for Text Generation

Using the Inference API for Text Generation

Learn how to construct structured requests to Llama Stack's Inference API using system and user prompts, and experiment with generation parameters to influence model output.

Now that we’ve created our first Llama Stack application using a preconfigured distribution, it’s time to learn how the most fundamental building block, inference, really works. While agents, tools, and RAG workflows offer advanced orchestration, nearly everything in Llama Stack starts with inference: the ability to generate text responses from a large language model.

In this lesson, we’ll explore the Inference API in detail, experiment with how prompts and parameters influence model behavior, and build a lightweight chatbot that supports memory via session-like message history. You’ll finish with a solid grasp of how to use Llama Stack to generate controlled, consistent, and useful outputs.

What is the inference API?

The inference API is the most direct way to interact with a language model in Llama Stack. It provides a consistent interface to send a list of messages (in the form of a chat history) and receive a generated response from a model.

The API is modeled after familiar chat interfaces like OpenAI’s chat/completions, but with a more structured, provider-agnostic format. Whether you’re using Together, Ollama, or another backend, the request format remains the same.

Calling the inference API involves:

  • Selecting a registered model (model_id).

  • Sending a list of structured messages.

  • (Optionally) specifying generation parameters like temperature, max_tokens, and top_p.

  • Receiving a structured response with the model’s output.

Press + to interact
The inference API
The inference API

This makes it ideal for lightweight use cases such as:

  • Prompt experimentation

  • One-shot or few-shot generation

  • Code completion

  • Chatbot scaffolding

Anatomy of an inference request

The message list is the core of every inference request. It is a sequence of objects that represent dialogue turns between different roles.

Each message has a role and content, where the role is one of:

    ...