Your First Llama Stack Application
Understand how pre-configured Llama Stack distributions work, how to interact with the system via Python code, and how to extend this setup for more complex tasks down the line.
We'll cover the following...
- The purpose of a quick-start distribution
- Step 1: Set up your Together AI account
- Step 2: Set up environment variables
- Step 3: Launch the Llama Stack server
- Step 4: Create your first application script
- Bonus step: Using the Together AI API
- Using LlamaStackAsLibraryClient
- Understanding the unified API interface
- Switching providers
- Complete code
- Final thoughts
After setting up our development environment locally, we’re now ready to run our first real application. We’ll build a simple script that connects to a remote model hosted by
The purpose of a quick-start distribution
Before diving into code, it’s important to understand why we’re using a distribution and what makes this example different from your local Ollama setup.
Llama Stack distributions are curated, pre-built environments designed to simplify configuration. They provide sensible defaults, bundle providers together, and make it easy to launch an entire system with a single command. Think of a distribution as a runnable blueprint that defines which model backend you’re using, how your APIs are wired, and what defaults should be exposed. The together
distribution specifically bundles:
Model adapters for Meta Llama 4 variants
Default inference timeouts and retry policies
Built-in logging, metrics collection, and health checks
Preconfigured provider middleware (e.g., rate-limit handling)
This saves you from manual configuration of endpoints, adapters, and observability layers.
Step 1: Set up your Together AI account
If you don’t already have a Together AI account:
Go to https://www.together.ai
Sign up using GitHub, Google, or email.
Navigate to the API Keys dashboard.
Generate a new key or copy an existing one.
Together AI offers free tier usage for many Llama 3 models (subject to rate limits). These hosted models are optimized for performance and reliability, making them ideal for testing and experimentation.
Copy your API key and store it safely. We’ll need to inject it into the Llama Stack runtime.
Step 2: Set up environment variables
Before starting the Llama Stack server, you’ll need to export some environment variables to make your API key and desired port available to the runtime.
You can add your Together AI API key in the widget below. This key will be saved for you, and you can access it in all the lessons of this course.
export TOGETHER_API_KEY={{{TOGETHER_API_KEY}}}export LLAMA_STACK_PORT=8321
Why do we set LLAMA_STACK_PORT
? Because distributions need to know which port the server should bind to. While 8321
is the default, it’s good practice to set it explicitly.
Step 3: Launch the Llama Stack server
Llama Stack includes a CLI tool that can launch servers based on built-in templates. Similar to how we used the ollama
template earlier, the together
distribution is another such template. Run:
uv run --with llama-stack llama stack build --template together --image-type venv --run
This command does several things:
Uses the
together
template to configure providers and models.Binds the Together inference provider to the Llama Stack
inference
API.Starts the server on the specified port (default: 8321).
At this point, the stack will be live and accessible. We’ve got a running API gateway that can translate your app’s requests into Together API calls and return structured responses.
Step 4: Create your first application script
Now that we have the server set up, it’s time to connect from a Python script using the Llama Stack SDK. This is where we start writing real application logic.
The widget below will take around 10 seconds to run. This is because we need to wait for the Llama Stack server to be ready before we can access it.
import osfrom llama_stack_client import LlamaStackClientclient = LlamaStackClient(base_url="http://0.0.0.0:8321",provider_data = {"together_api_key": os.environ['TOGETHER_API_KEY']})models = client.models.list()print("Available Models:")for model in models:print(f"- {model.identifier} ({model.model_type})")print("\nSending a chat request:")model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo"response = client.inference.chat_completion(model_id=model_id,messages=[{"role": "system", "content": "You are a friendly assistant."},{"role": "user", "content": "What is the chemical symbol for water?"},],)print(response.completion_message.content)
Here’s what is happening in the code:
Lines 1–2: We import the
LlamaStackClient
class from thellama_stack_client
library. This client provides an interface to interact with a locally or remotely hosted LlamaStack server and its associated model providers.Lines 4–7: We create an instance of
LlamaStackClient
.base_url="http://0.0.0.0:8321"
tells the client where to reach the LlamaStack server. In this case, it’s running locally on port 8321.provider_data
is a dictionary that contains provider-specific authentication credentials.We access the
TOGETHER_API_KEY
from the environment to authenticate with the Together AI provider, ensuring that the API key stays private and isn’t exposed in code.
Line 9: We call the
.list()
method on themodels
resource of the client. This returns a list of available models registered with the LlamaStack server, including those available via integrated providers (like Together AI).Lines 11–13: We iterate over the list of models and print each model’s identifier and its type (e.g.,
chat
,completion
,embedding
).model.identifier
refers to the model’s unique name (likemeta-llama/Llama-3.3-70B-Instruct-Turbo
).model.model_type
indicates what kind of task the model is designed for.
Line 76: We define the model that we want to use as
model_id
. For now, we have chosenmeta-llama/Llama-3.2-3B-Instruct-Turbo
.Lines 18–19: We then send a chat-based prompt to the model we selected (
model_id
), requesting a natural language response.Lines 20–23: The
messages
parameter follows the OpenAI-style chat format:The
"system"
message sets the assistant’s behavior.The
"user"
message is the actual question or prompt.
Line 26: The response returned includes the assistant’s reply, wrapped in a structured object.
And just like that, we have our Llama Stack server running and our client able to access it!
In case you were wondering, we use the uv run
command followed by our file name main.py
to execute the code above.
Bonus step: Using the Together AI API
One of the benefits of using a provider like TogetherAI is the availability of an API. This allows us to access a hosted Llama Stack server using our API key. Using an API also means that we no longer have to run and host our own Llama Stack server, which can be great for developers looking to get started quickly. Let’s see how we can use the API.
import osfrom llama_stack_client import LlamaStackClientclient = LlamaStackClient(base_url="https://llama-stack.together.ai",provider_data = {"together_api_key": os.environ['TOGETHER_API_KEY']})models = client.models.list()print("Available Models:")for model in models:print(f"- {model.identifier} ({model.model_type})")print("\nSending a chat request:")model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo"response = client.inference.chat_completion(model_id=model_id,messages=[{"role": "system", "content": "You are a friendly assistant."},{"role": "user", "content": "What is the chemical symbol for water?"},],)print(response.completion_message.content)
The only thing that we have changed is our base_url
. By swapping out our base_url
to https://llama-stack.together.ai
, we instantly gain access to production-grade infrastructure (autoscaling, health checks, metrics) and optimized model endpoints.
This approach dramatically reduces our operational overhead, ideal for quick prototypes or small teams, because we don’t need to worry about server uptime, library dependencies, or port conflicts. The trade-off is that we’re tied to the models and features that Together AI supports; if we later need custom plugins, on-prem data, or low-latency inference in a private network, we can switch back to running our own Llama Stack server without changing any of the application code.
Using LlamaStackAsLibraryClient
So far, we have been running the Llama Stack server separately. However, if you’re building an app where you want Llama Stack to run in-process, without manually starting a separate server, you can use:
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
This approach embeds the stack logic directly into the Python app. It’s useful for small apps, notebooks, or cloud functions. But for production workflows, the server model gives you more control, scalability, and monitoring.
import os from llama_stack.distribution.library_client import LlamaStackAsLibraryClient client = LlamaStackAsLibraryClient( "together", provider_data = { "together_api_key": os.environ['TOGETHER_API_KEY'] } ) print("Initializing...") client.initialize() print("Ready") models = client.models.list() print("Available Models:") for model in models: print(f"- {model.identifier} ({model.model_type})") print("\nSending a chat request:") model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo" response = client.inference.chat_completion( model_id=model_id, messages=[ {"role": "system", "content": "You are a friendly assistant."}, {"role": "user", "content": "What is the chemical symbol for water?"}, ], ) print(response.completion_message.content)
Here’s what we have changed in the code above:
Line 2: Import
LlamaStackAsLibraryClient
for using LlamaStack directly as a Python library.Lines 4–8: Create a client instance targeting the
together
provider. Thetogether_api_key
is securely pulled from the environment.Line 12: Call
client.initialize()
to set up the LlamaStack runtime environment. This prepares the models, dependencies, or any internal backend setup.
Understanding the unified API interface
Whether you’re using Ollama locally or Together in the cloud, the application code doesn’t change. That’s the beauty of the unified API.
Here’s what stays constant:
We call
client.inference.chat_completion()
with a list of messages.We identify models using their registered
identifier
.We get a standard
completion
object with one or more choices.
Behind the scenes, the Llama Stack server decides which provider to use and how to call it. This means:
We can test on local CPU, then deploy to cloud GPUs.
We can switch from Together to any other provider by changing the provider config.
We can write tools or agents without worrying about backend specifics.
This is what makes Llama Stack ideal for multi-environment applications.
Here's a quick challenge. Imagine you are working on three different projects:
A quick prototype for a hackathon.
An internal enterprise tool for sensitive data.
A personal project in a Jupyter Notebook.
Which Llama Stack configuration would you choose for each, and why? Answer the question in the widget below.
Switching providers
Later, you can build and run the same app with a local Ollama model by changing the provider to ollama
:
client = LlamaStackAsLibraryClient("ollama",)
Changing the model to the one we used earlier:
model_id = "llama3.2:1b"
The script itself stays the same. Feel free to make these changes in the code widget above.
Complete code
As we mentioned earlier, running the code above requires the Llama Stack server to be running. On the Educative platform, we take care of that for you, but if you want a single script that can start the server and the client for you, you can refer to the code widget below.
Some part of the following code has been taken from the Llama Stack GitHub repository.
import os import subprocess import time import requests from llama_stack_client import LlamaStackClient from requests.exceptions import ConnectionError def run_llama_stack_server_background(): log_file = open("llama_stack_server.log", "w") process = subprocess.Popen( "uv run --with llama-stack llama stack run together --image-type venv", shell=True, stdout=log_file, stderr=log_file, text=True ) print(f"Starting Llama Stack server with PID: {process.pid}") return process def wait_for_server_to_start(): url = "http://0.0.0.0:8321/v1/health" max_retries = 30 retry_interval = 1 print("Waiting for server to start", end="") for _ in range(max_retries): try: response = requests.get(url) if response.status_code == 200: print("\nServer is ready!") return True except ConnectionError: print(".", end="", flush=True) time.sleep(retry_interval) print("\nServer failed to start after", max_retries * retry_interval, "seconds") return False def kill_llama_stack_server(): os.system("ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9") # Start the server server_process = run_llama_stack_server_background() # Wait for the server to be ready assert wait_for_server_to_start() client = LlamaStackClient( base_url="http://0.0.0.0:8321", provider_data = { "together_api_key": os.environ['TOGETHER_API_KEY'] } ) models = client.models.list() print("Available Models:") for model in models: print(f"- {model.identifier} ({model.model_type})") print("\nSending a chat request:") model_id = "meta-llama/Llama-3.2-3B-Instruct-Turbo" response = client.inference.chat_completion( model_id=model_id, messages=[ {"role": "system", "content": "You are a friendly assistant."}, {"role": "user", "content": "What is the chemical symbol for water?"}, ], ) print(response.completion_message.content)
Here’s what is happening in the code:
Lines 8–19: The
run_llama_stack_server_background()
function launches the Llama Stack server in the background.Line 9: It opens a log file (
llama_stack_server.log
) to capture the server’s output.Lines 10–11: It uses
subprocess.Popen()
to run the command:uv run --with llama-stack llama stack run together --image-type venv
. This starts the Llama Stack server using theuv
CLI.Line 12: The process is run asynchronously (
shell=True
), allowing the Python script to continue running.Lines 13–14: Output and errors are redirected to the log file.
Lines 18–19: The function prints the process ID and returns the
Popen
object representing the server process.
Lines 21–38: The
wait_for_server_to_start()
function polls the Llama Stack server’s health endpoint to check if it’s up and running.Lines 28–39: The health check is done by sending a GET request to
http://0.0.0.0:8321/v1/health
.Lines 30–32: If the server responds with status code
200
, it prints a success message and returnsTrue
.Line 34: While waiting, it prints dots (
.
) to show progress.Line 35: It attempts up to 30 retries, sleeping for 1 second between each attempt.
Line 38: If it doesn’t respond in time, it prints a failure message and returns
False
.
Lines 40–41: The
kill_llama_stack_server()
utility function forcefully kills any running Llama Stack server processes.It uses
os.system()
to run a shell pipeline:Finds processes matching
llama_stack.distribution.server.server
.Filters out the grep command itself.
Extracts their process IDs and sends
SIGKILL
(-9
) to terminate them immediately.
Line 44:
server_process = run_llama_stack_server_background()
starts the server.Line 46:
assert wait_for_server_to_start()
checks if the server is ready to accept requests. If it isn't, the assertion fails and stops execution.
Final thoughts
We’ve now built our first fully functioning AI application using Llama Stack. We connected to a hosted model, structured our prompt using the SDK, and returned a response through the unified inference API. More importantly, we’ve done so using the same interfaces and abstractions that we’ll use when adding agents, memory, safety, and tools. What we’ve learned here is foundational and reusable across every app you’ll build with Llama Stack.