...

/

Setting Up a Development Environment for Llama Stack

Setting Up a Development Environment for Llama Stack

Understand the process of installing Llama Stack, setting up a local inference backend, and running your first test application using the Python SDK.

Getting started with Llama Stack doesn’t require a GPU cluster or managed cloud infrastructure. The stack’s design philosophy encourages starting small—on your laptop, using lightweight providers—and scaling up only when your application demands it. This local-first mindset is ideal for rapid prototyping, debugging, and experimentation.

In this lesson, you’ll use uv, a fast Python package and environment manager, to set up a clean development workspace. Then you’ll install the core components of Llama Stack, set up Ollama as your inference backend, and run your first inference call through the SDK. You’ll have a working dev environment ready for more advanced builds by the end.

Why local-first development?

Llama Stack was built with local development in mind. This differentiates it from frameworks that assume access to high-powered GPUs or cloud credits. Local setups are:

  • Faster to iterate: You can try, break, and rerun without waiting on remote servers.

  • More transparent: You can access logs, models, and configurations without abstraction layers.

  • Easier to control: No external dependencies, rate limits, or vendor lock-in.

For this reason, our initial setup will use:

  • Ollama for running inference locally via Llama 3 models

  • Llama Stack Python SDK for interacting with the APIs

You’ll eventually be able to swap these out with remote providers, but the interface and logic will remain consistent.

Why use uv?

While traditional pip and venv workflows are common, uv provides a faster, more modern alternative with better dependency resolution and caching. It combines the functionality of a virtual environment manager and a Python package installer.

Press + to interact

Benefits of using uv include:

  • Fast dependency resolution and installation.

  • Automatically manages virtual environments.

  • Compatible with pip commands, but faster and cleaner.

  • Officially used in Llama Stack’s development workflows.

You’ll use uv throughout this course to install, manage, and run Llama Stack apps and providers.

The installation instructions provided here are just for reference. The setup has already been done for you on Educative!

Step 1: Install uv

First, install uv globally. You only need to do this once:

curl -Ls https://astral.sh/uv/install.sh | sh

You can then verify the installation using:

uv --version

You should see an output like uv 0.x.x.

Step 2: Initialize your project environment

Create a folder for your Llama Stack project:

mkdir llama-stack-app
cd llama-stack-app

Initialize the environment using uv:

uv venv
source .venv/bin/activate

You now have a clean, isolated Python environment ready to go.

Step 3: Install the Llama Stack Python SDK

With your environment activated, install the Llama Stack SDK using uv pip:

uv pip install -U llama-stack

This package includes:

  • The core client classes to interact with the Llama Stack server.

  • Data types and utilities for building agents, inserting documents, etc.

  • Support for connecting to local or remote stacks.

If needed, you can verify your local installation by running:

echo "from llama_stack_client import LlamaStackClient; print('Llama Stack SDK installed.')" | uv run -

Step 4: Install and run Ollama for local inference

Ollama is a local LLM runtime that allows you to run Llama models on CPU or GPU. It supports models like llama4, mistral, and gemma. You can visit ollama.com/download and install the appropriate package for your system.

Press + to interact

Ollama can also be installed using the following command:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, we will need to start the Ollama server. This will allow us to pull and run models. We can leave Ollama running in the background as a service, or use the following command to keep it available for API usage:

ollama serve

As working with a single terminal, we can append > /dev/null 2>&1 & to the ollama serve command to run the serve command in the background.

Now that Ollama is running, we can pull the model to use. For our testing, we have chosen the Llama 3.2-1B model. This is a small model that can run on very modest hardware. We can use the run command if you want to interact with the model. The run command will automatically pull or download the model for us.

ollama run llama3.2:1b

This will download and start the Llama 3.2-1B model. You can test it interactively to confirm it’s working in the terminal below. We have already started Ollama in the background for the terminal below. To exit the chat, type /exit.

Terminal 1
Terminal
Loading...

Now that the model is ready, let’s set the INFERENCE_MODEL environment variable to llama3.2:1b. This environment variable is used by Llama Stack internally.

export INFERENCE_MODEL=llama3.2:1b

We have already exported the environment variable for you.

Step 5: Run the Llama Stack server with Ollama configuration

Now that Ollama is running, we’ll configure Llama Stack to use it for inference.

Llama Stack runs as a server exposing multiple APIs, and you connect to it using the client SDK. We can build and run the Llama Stack server using a YAML configuration file. The YAML configuration file will allow us to customize the server to our liking; however, we can use a provided template for a quick start with Ollama.

Run the following command within an activated virtual environment to build and run the server using the Ollama template:

uv run --with llama-stack llama stack build --template ollama --image-type venv --run

The command may seem long and complex, so let’s break it down:

  • uv run: Executes a command or script inside the managed environment.

  • --with llama-stack: Indicates that uv should use a tool, plugin, or environment named llama-stack.

  • llama stack build --template ollama --image-type venv --run: This is the actual command being run by uv. It breaks down into:

    • llama: The llama CLI tool helps us set up and use Llama Stack. This was installed when we installed the llama-stack package.

    • stack build: Calls a subcommand build on the stack module of llama. This builds our application stack.

    • --template ollama: Specifies the template to use. Since we are using Ollama, we will build an existing template and use the template parameter to ollama.

    • --image-type venv: This flag specifies the image type to use when running the stack. In this case, we set the type of the image to venv. It can also be set to be a Conda environment or a container (eg, Docker).

    • --run: Once the stack is built, it immediately runs the stack.

Once you run this command, you should see logs confirming the server is running on http://0.0.0.0:8321.

Terminal 1
Terminal
Loading...

Great! Now that our Llama Stack server is running, we can use the Llama Stack client to access

Step 6: Testing the Llama Stack server

At this point, we’ve set up:

  • A local LLM runtime (Ollama) serving a Llama 3.2 model

  • A Llama Stack server configured to use that runtime for inference

Now, let’s test if everything is working as expected. We can use the llama-stack-client library from the terminal to test the server. Here’s a simple command that will generate a response from the model.

llama-stack-client inference chat-completion --message "tell me a joke"

Here, we use the inference command followed by chat_completion to send a message to the model. Don’t worry if you’re unfamiliar with inference or chat_completion, we will discuss them later in the content. For now, just try it out in the terminal below.

Before running the command above, please wait a moment to ensure that the LLM runtime and LlamaStack server are fully set up in the background. This setup happens automatically but may take 10 to 20 seconds.

Terminal 1
Terminal
Loading...

You should see a ChatCompletionResponse object as a result. This object will have a lot of parameters that you may or may not be familiar with. Do not worry! We will dive into these parameters soon.

Closing thoughts

You now have a fully functional Llama Stack development environment. You’ve configured a local inference provider, connected through the SDK, and issued your first API call. Most importantly, this setup will continue to work as you introduce new components like retrieval, safety, tools, and evaluation. The goal moving forward is to build layer by layer, adding richer logic, memory, safety filters, and eventually deployment workflows. But the foundation you’ve just created will remain consistent, even as you scale up.