Llama Stack: From Fundamentals to Deployment/

...

Your First Llama Stack Application

Understand how pre-configured Llama Stack distributions work, how to interact with the system via Python code, and how to extend this setup for more complex tasks down the line.

We'll cover the following...

The purpose of a quick-start distribution
Step 1: Set up your Together AI account
Step 2: Set up environment variables
Step 3: Launch the Llama Stack server
Step 4: Create your first application script
Bonus step: Using the Together AI API
Using LlamaStackAsLibraryClient
Understanding the unified API interface
Switching providers
Complete code
Final thoughts

After setting up our development environment locally, we’re now ready to run our first real application. We’ll build a simple script that connects to a remote model hosted by Together AITogether AI is a company that offers a cloud platform for building, training, and deploying open-source AI models. It supports many models like Llama and DeepSeek, and provides flexible options for running them in the cloud, in private environments, or on-premises. , sends a prompt, and receives a response from Llama Stack. This will be the “Hello World” moment of our Llama Stack journey, but with more than just a print statement. This will be a complete interaction with the full stack: environment, server, SDK, model, and API, all coming together.

Press + to interact

The purpose of a quick-start distribution

Before diving into code, it’s important to understand why we’re using a distribution and what makes this example different from your local Ollama setup.

Llama Stack distributions are curated, pre-built environments designed to simplify configuration. They provide sensible defaults, bundle providers together, and make it easy to launch an entire system with a single command. Think of a distribution as a runnable blueprint that defines which model backend you’re using, how your APIs are wired, and what defaults should be exposed. The together distribution specifically bundles:

Model adapters for Meta Llama 4 variants
Default inference timeouts and retry policies
Built-in logging, metrics collection, and health checks
Preconfigured provider middleware (e.g., rate-limit handling)

This saves you from manual configuration of endpoints, adapters, and observability layers.

Step 1: Set up your Together AI account

If you don’t already have a Together AI account:

Go to https://www.together.ai
Sign up using GitHub, Google, or email.
Navigate to the API Keys dashboard.
Generate a new key or copy an existing one.

Together AI offers free tier usage for many Llama 3 models (subject to rate limits). These hosted models are optimized for performance and reliability, making them ideal for testing and experimentation.

Copy your API key and store it safely. We’ll need to inject it into the Llama Stack runtime.

Step 2: Set up environment variables

Before starting the Llama Stack server, you’ll need to export some environment variables to make your API key and desired port available to the runtime.

You can add your Together AI API key in the widget below. This key will be saved for you, and you can access it in all the lessons of this course.

Press + to interact

This command does several things:

Uses the together template to configure providers and models.
Binds the Together inference provider to the Llama Stack inference API.
Starts the server on the specified port (default: 8321).

At this point, the stack will be live and accessible. We’ve got a running API gateway that can translate your app’s requests into Together API calls and return structured responses.

Step 4: Create your first application script

Now that we have the server set up, it’s time to connect from a Python script using the Llama Stack SDK. This is where we start writing real application logic.

The widget below will take around 10 seconds to run. This is because we need to wait for the Llama Stack server to be ready before we can access it.

Press + to interact

Python 3.10.4

import os
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(
    base_url="http://0.0.0.0:8321", 
    provider_data = {
        "together_api_key": os.environ['TOGETHER_API_KEY']
    }
)
models = client.models.list()
print("Available Models:")
for model in models:
    print(f"- {model.identifier} ({model.model_type})")
    
print("\nSending a chat request:")
model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo"
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "What is the chemical symbol for water?"},
    ],
)
print(response.completion_message.content)

Here’s what is happening in the code:

Lines 1–2: We import the LlamaStackClient class from the llama_stack_client library. This client provides an interface to interact with a locally or remotely hosted LlamaStack server and its associated model providers.
Lines 4–7: We create an instance of LlamaStackClient.
- base_url="http://0.0.0.0:8321" tells the client where to reach the LlamaStack server. In this case, it’s running locally on port 8321.
- provider_data is a dictionary that contains provider-specific authentication credentials.
  - We access the TOGETHER_API_KEY from the environment to authenticate with the Together AI provider, ensuring that the API key stays private and isn’t exposed in code.
Line 9: We call the .list() method on the models resource of the client. This returns a list of available models registered with the LlamaStack server, including those available via integrated providers (like Together AI).
Lines 11–13: We iterate over the list of models and print each model’s identifier and its type (e.g., chat, completion, embedding).
- model.identifier refers to the model’s unique name (like meta-llama/Llama-3.3-70B-Instruct-Turbo).
- model.model_type indicates what kind of task the model is designed for.
Line 76: We define the model that we want to use as model_id. For now, we have chosen meta-llama/Llama-3.2-3B-Instruct-Turbo.
Lines 18–19: We then send a chat-based prompt to the model we selected (model_id), requesting a natural language response.
Lines 20–23: The messages parameter follows the OpenAI-style chat format:
- The "system" message sets the assistant’s behavior.
- The "user" message is the actual question or prompt.
Line 26: The response returned includes the assistant’s reply, wrapped in a structured object.

And just like that, we have our Llama Stack server running and our client able to access it!

In case you were wondering, we use the uv run command followed by our file name main.py to execute the code above.

Bonus step: Using the Together AI API

One of the benefits of using a provider like TogetherAI is the availability of an API. This allows us to access a hosted Llama Stack server using our API key. Using an API also means that we no longer have to run and host our own Llama Stack server, which can be great for developers looking to get started quickly. Let’s see how we can use the API.

Press + to interact

Python 3.10.4

import os
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(
    base_url="https://llama-stack.together.ai", 
    provider_data = {
        "together_api_key": os.environ['TOGETHER_API_KEY']
    }
)
models = client.models.list()
print("Available Models:")
for model in models:
    print(f"- {model.identifier} ({model.model_type})")
    
print("\nSending a chat request:")
model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo"
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "What is the chemical symbol for water?"},
    ],
)
print(response.completion_message.content)

The only thing that we have changed is our base_url. By swapping out our base_url to https://llama-stack.together.ai, we instantly gain access to production-grade infrastructure (autoscaling, health checks, metrics) and optimized model endpoints.

This approach dramatically reduces our operational overhead, ideal for quick prototypes or small teams, because we don’t need to worry about server uptime, library dependencies, or port conflicts. The trade-off is that we’re tied to the models and features that Together AI supports; if we later need custom plugins, on-prem data, or low-latency inference in a private network, we can switch back to running our own Llama Stack server without changing any of the application code.

Using `LlamaStackAsLibraryClient`

So far, we have been running the Llama Stack server separately. However, if you’re building an app where you want Llama Stack to run in-process, without manually starting a separate server, you can use:

import os
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient(
    "together",
    provider_data = {
        "together_api_key": os.environ['TOGETHER_API_KEY']
    }
)

print("Initializing...")
client.initialize()
print("Ready")

models = client.models.list()
print("Available Models:")
for model in models:
    print(f"- {model.identifier} ({model.model_type})")
    
print("\nSending a chat request:")
model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo"
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "What is the chemical symbol for water?"},
    ],
)

print(response.completion_message.content)

Using LlamaStackAsLibraryClient

Here’s what we have changed in the code above:

Line 2: Import LlamaStackAsLibraryClient for using LlamaStack directly as a Python library.
Lines 4–8: Create a client instance targeting the together provider. The together_api_key is securely pulled from the environment.
Line 12: Call client.initialize() to set up the LlamaStack runtime environment. This prepares the models, dependencies, or any internal backend setup.

Understanding the unified API interface

Whether you’re using Ollama locally or Together in the cloud, the application code doesn’t change. That’s the beauty of the unified API.

Here’s what stays constant:

We call client.inference.chat_completion() with a list of messages.
We identify models using their registered identifier.
We get a standard completion object with one or more choices.

Behind the scenes, the Llama Stack server decides which provider to use and how to call it. This means:

We can test on local CPU, then deploy to cloud GPUs.
We can switch from Together to any other provider by changing the provider config.
We can write tools or agents without worrying about backend specifics.

This is what makes Llama Stack ideal for multi-environment applications.

Here's a quick challenge. Imagine you are working on three different projects:

A quick prototype for a hackathon.
An internal enterprise tool for sensitive data.
A personal project in a Jupyter Notebook.

Which Llama Stack configuration would you choose for each, and why? Answer the question in the widget below.

import os 
import subprocess
import time
import requests
from llama_stack_client import LlamaStackClient
from requests.exceptions import ConnectionError

def run_llama_stack_server_background():
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        "uv run --with llama-stack llama stack run together --image-type venv",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True
    )
    
    print(f"Starting Llama Stack server with PID: {process.pid}")
    return process

def wait_for_server_to_start():
    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1
    
    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)
            
    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False

def kill_llama_stack_server():
    os.system("ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9")

# Start the server
server_process = run_llama_stack_server_background()
# Wait for the server to be ready
assert wait_for_server_to_start()

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321", 
    provider_data = {
        "together_api_key": os.environ['TOGETHER_API_KEY']
    }
)

models = client.models.list()
print("Available Models:")
for model in models:
    print(f"- {model.identifier} ({model.model_type})")
    
print("\nSending a chat request:")
model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo"
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "What is the chemical symbol for water?"},
    ],
)

print(response.completion_message.content)

Complete code to run the server and the client

Here’s what is happening in the code:

Lines 8–19: The run_llama_stack_server_background() function launches the Llama Stack server in the background.
- Line 9: It opens a log file (llama_stack_server.log) to capture the server’s output.
- Lines 10–11: It uses subprocess.Popen() to run the command: uv run --with llama-stack llama stack run together --image-type venv. This starts the Llama Stack server using the uv CLI.
- Line 12: The process is run asynchronously (shell=True), allowing the Python script to continue running.
- Lines 13–14: Output and errors are redirected to the log file.
- Lines 18–19: The function prints the process ID and returns the Popen object representing the server process.
Lines 21–38: The wait_for_server_to_start() function polls the Llama Stack server’s health endpoint to check if it’s up and running.
- Lines 28–39: The health check is done by sending a GET request to http://0.0.0.0:8321/v1/health.
- Lines 30–32: If the server responds with status code 200, it prints a success message and returns True.
- Line 34: While waiting, it prints dots (.) to show progress.
- Line 35: It attempts up to 30 retries, sleeping for 1 second between each attempt.
- Line 38: If it doesn’t respond in time, it prints a failure message and returns False.
Lines 40–41: The kill_llama_stack_server() utility function forcefully kills any running Llama Stack server processes.
- It uses os.system() to run a shell pipeline:
- Finds processes matching llama_stack.distribution.server.server.
- Filters out the grep command itself.
- Extracts their process IDs and sends SIGKILL (-9) to terminate them immediately.
Line 44: server_process = run_llama_stack_server_background() starts the server.
Line 46: assert wait_for_server_to_start() checks if the server is ready to accept requests. If it isn't, the assertion fails and stops execution.

Final thoughts

We’ve now built our first fully functioning AI application using Llama Stack. We connected to a hosted model, structured our prompt using the SDK, and returned a response through the unified inference API. More importantly, we’ve done so using the same interfaces and abstractions that we’ll use when adding agents, memory, safety, and tools. What we’ve learned here is foundational and reusable across every app you’ll build with Llama Stack.

Getting Started with Llama Stack

Core Building Blocks: Architecture and Inference

Agents, Tools, and Retrieval with Llama Stack

Safety, Monitoring, and Evaluation

Advanced Integration and Beyond

Conclusion

Your First Llama Stack Application

The purpose of a quick-start distribution

Step 1: Set up your Together AI account

Step 2: Set up environment variables

Step 3: Launch the Llama Stack server

Step 4: Create your first application script

Bonus step: Using the Together AI API

Using `LlamaStackAsLibraryClient`

Understanding the unified API interface

Switching providers

Complete code

Final thoughts