Working with Different Providers and Models
Learn about the flexibility of Llama Stack in working with different providers and models, and understand the benefits of provider abstraction.
Deploying AI applications effectively means navigating a diverse landscape of cost, latency, and hardware considerations. You might begin testing on a local CPU and then need the performance of hosted GPUs for production. Perhaps embeddings can run offline to save costs, while core inference tasks require cloud scalability, or you need to switch between these configurations seamlessly. This constant re-evaluation and adaptation can be a major development bottleneck.
Llama Stack simplifies this complexity. Instead of rewriting your application for each new setup, it allows you to abstract these infrastructure differences through a system of providers and distributions. You define your application's needs (like inference, retrieval, or safety), and Llama Stack manages which underlying system fulfills those needs, streamlining your path from development to deployment.
Generating a configuration
We’ll start by initializing a fresh Llama Stack distribution using the CLI:
llama stack build
This will open an interactive session where we can pick and choose our providers for the various APIs. In the ...