...
/Setting Up a Development Environment for Llama Stack
Setting Up a Development Environment for Llama Stack
Understand the process of installing Llama Stack, setting up a local inference backend, and running your first test application using the Python SDK.
We'll cover the following...
- Why local-first development?
- Why use uv?
- Step 1: Install uv
- Step 2: Initialize your project environment
- Step 3: Install the Llama Stack Python SDK
- Step 4: Install and run Ollama for local inference
- Step 5: Run the Llama Stack server with Ollama configuration
- Step 6: Testing the Llama Stack server
- Closing thoughts
Getting started with Llama Stack doesn’t require a GPU cluster or managed cloud infrastructure. The stack’s design philosophy encourages starting small—on your laptop, using lightweight providers—and scaling up only when your application demands it. This local-first mindset is ideal for rapid prototyping, debugging, and experimentation.
In this lesson, you’ll use uv
, a fast Python package and environment manager, to set up a clean development workspace. Then you’ll install the core components of Llama Stack, set up Ollama as your inference backend, and run your first inference call through the SDK. You’ll have a working dev environment ready for more advanced builds by the end.
Why local-first development?
Llama Stack was built with local development in mind. This differentiates it from frameworks that assume access to high-powered GPUs or cloud credits. Local setups are:
Faster to iterate: You can try, break, and rerun without waiting on remote servers.
More transparent: You can access logs, models, and configurations without abstraction layers.
Easier to control: No external dependencies, rate limits, or vendor lock-in.
For this reason, our initial setup will use:
Ollama
for running inference locally via Llama 3 modelsLlama Stack Python SDK
for interacting with the APIs
You’ll eventually be able to swap these out with remote providers, but the interface and logic will remain consistent.
Why use uv
?
While traditional pip
and venv
workflows are common, uv
provides a faster, more modern alternative with better dependency resolution and caching. It combines the functionality of a virtual environment manager and a Python package installer.
Benefits of using uv
include:
Fast dependency resolution and installation.
Automatically manages virtual environments.
Compatible with
pip
commands, but faster and cleaner.Officially used in Llama Stack’s development workflows.
You’ll use uv
throughout this course to install, manage, and run Llama Stack apps and providers.
The installation instructions provided here are just for reference. The setup has already been done for you on Educative!
Step 1: Install uv
First, install uv
globally. You only need to do this once:
curl -Ls https://astral.sh/uv/install.sh | sh
You can then verify the installation using:
uv --version
You should see an output like uv 0.x.x
.
Step 2: Initialize your project environment
Create a folder for your Llama Stack project:
mkdir llama-stack-appcd llama-stack-app
Initialize the environment using uv
:
uv venvsource .venv/bin/activate
You now have a clean, isolated Python environment ready to go.
Step 3: Install the Llama Stack Python SDK
With your environment activated, install the Llama Stack SDK using uv pip
:
uv pip install -U llama-stack
This package includes:
The core client classes to interact with the Llama Stack server.
Data types and utilities for building agents, inserting documents, etc.
Support for connecting to local or remote stacks.
If needed, you can verify your local installation by running:
echo "from llama_stack_client import LlamaStackClient; print('Llama Stack SDK installed.')" | uv run -
Step 4: Install and run Ollama for local inference
Ollama is a local LLM runtime that allows you to run Llama models on CPU or GPU. It supports models like llama4
, mistral
, and gemma
. You can visit ollama.com/download and install the appropriate package for your system.
Ollama can also be installed using the following command:
curl -fsSL https://ollama.com/install.sh | sh
Once installed, we will need to start the Ollama server. This will allow us to pull
and run
models. We can leave Ollama running in the background as a service, or use the following command to keep it available for API usage:
ollama serve
As working with a single terminal, we can append > /dev/null 2>&1 &
to the ollama serve
command to run the serve command in the background.
Now that Ollama is running, we can pull
the model to use. For our testing, we have chosen the Llama 3.2-1B model. This is a small model that can run on very modest hardware. We can use the run
command if you want to interact with the model. The run
command will automatically pull
or download the model for us.
ollama run llama3.2:1b
This will download and start the Llama 3.2-1B model. You can test it interactively to confirm it’s working in the terminal below. We have already started Ollama in the background for the terminal below. To exit the chat, type /exit
.
Now that the model is ready, let’s set the INFERENCE_MODEL
environment variable to llama3.2:1b
. This environment variable is used by Llama Stack internally.
export INFERENCE_MODEL=llama3.2:1b
We have already exported the environment variable for you.
Step 5: Run the Llama Stack server with Ollama configuration
Now that Ollama is running, we’ll configure Llama Stack to use it for inference.
Llama Stack runs as a server exposing multiple APIs, and you connect to it using the client SDK. We can build and run the Llama Stack server using a YAML configuration file. The YAML configuration file will allow us to customize the server to our liking; however, we can use a provided template for a quick start with Ollama.
Run the following command within an activated virtual environment to build and run the server using the Ollama template:
uv run --with llama-stack llama stack build --template ollama --image-type venv --run
The command may seem long and complex, so let’s break it down:
uv run
: Executes a command or script inside the managed environment.--with llama-stack
: Indicates thatuv
should use a tool, plugin, or environment namedllama-stack
.llama stack build --template ollama --image-type venv --run
: This is the actual command being run byuv
. It breaks down into:llama
: The llama CLI tool helps us set up and use Llama Stack. This was installed when we installed thellama-stack
package.stack build
: Calls a subcommandbuild
on thestack
module ofllama
. This builds our application stack.--template ollama
: Specifies the template to use. Since we are using Ollama, we will build an existing template and use thetemplate
parameter toollama
.--image-type venv
: This flag specifies the image type to use when running the stack. In this case, we set the type of the image tovenv
. It can also be set to be a Conda environment or a container (eg, Docker).--run
: Once the stack is built, it immediately runs the stack.
Once you run this command, you should see logs confirming the server is running on http://0.0.0.0:8321
.
Great! Now that our Llama Stack server is running, we can use the Llama Stack client to access
Step 6: Testing the Llama Stack server
At this point, we’ve set up:
A local LLM runtime (Ollama) serving a Llama 3.2 model
A Llama Stack server configured to use that runtime for inference
Now, let’s test if everything is working as expected. We can use the llama-stack-client
library from the terminal to test the server. Here’s a simple command that will generate a response from the model.
llama-stack-client inference chat-completion --message "tell me a joke"
Here, we use the inference
command followed by chat_completion
to send a message
to the model. Don’t worry if you’re unfamiliar with inference
or chat_completion
, we will discuss them later in the content. For now, just try it out in the terminal below.
Before running the command above, please wait a moment to ensure that the LLM runtime and LlamaStack server are fully set up in the background. This setup happens automatically but may take 10 to 20 seconds.
You should see a ChatCompletionResponse
object as a result. This object will have a lot of parameters that you may or may not be familiar with. Do not worry! We will dive into these parameters soon.
Closing thoughts
You now have a fully functional Llama Stack development environment. You’ve configured a local inference provider, connected through the SDK, and issued your first API call. Most importantly, this setup will continue to work as you introduce new components like retrieval, safety, tools, and evaluation. The goal moving forward is to build layer by layer, adding richer logic, memory, safety filters, and eventually deployment workflows. But the foundation you’ve just created will remain consistent, even as you scale up.