Orchestration and Application Logic
Explore how to design and implement orchestration logic in LLM systems to enable conversational memory while keeping APIs stateless and scalable. Understand token budgeting, query rewriting, and state persistence using durable storage. Learn to turn your LLM app into a robust, observable system capable of handling multi-turn conversations without sacrificing performance or accuracy.
We deployed a scalable API. The system ingests data, retrieves vectors, and returns query results. Functionally, this is a search engine rather than a conversational assistant. A search engine is stateless, meaning each query is processed in isolation. If a user queries API Keys and then asks: How do I rotate them? The second query is still processed independently. The system has no awareness that them refers to API Keys. To support conversational behavior, we need to add orchestration logic.
From an LLMOps perspective, adding memory is not a UX feature layered on top of generation; it is a systems problem that introduces new operational constraints:
Token budgeting: We cannot continuously feed the entire conversation history into the model. We run out of context window space and money.
Latency: Every additional step, such as loading history, rewriting queries, and persisting state, adds I/O and compute overhead.
Ambiguity: Pronouns and references (it, that, they) cannot be indexed or embedded meaningfully without context.
In this lesson, we will design and implement a stateful orchestration layer while keeping the API itself stateless and horizontally scalable. We will manage memory explicitly, enforce strict token discipline, and introduce query rewriting as a model-routing pattern to resolve ambiguity before it reaches the retrieval stage.
Stateless API and stateful backend
A cardinal rule of scalable microservices is that the API layer must remain ...