Working with Different Providers and Models

Learn about the flexibility of Llama Stack in working with different providers and models, and understand the benefits of provider abstraction.

Deploying AI applications effectively means navigating a diverse landscape of cost, latency, and hardware considerations. You might begin testing on a local CPU and then need the performance of hosted GPUs for production. Perhaps embeddings can run offline to save costs, while core inference tasks require cloud scalability, or you need to switch between these configurations seamlessly. This constant re-evaluation and adaptation can be a major development bottleneck.