Design of an LLM-powered Customer Support Bot
Design an LLM customer support bot using Retrieval-Augmented Generation (RAG) and cost-aware routing. Learn how RAG pipelines ground responses in current knowledge, while tiered model routing reduces operational costs by up to 60%. This architecture ensures accuracy, scalability, and graceful human escalation.
In the previous lesson, we explored the requirements and resource estimations for an LLM-powered customer support bot. Building on that, we will now move on to the high-level design of the system.
High-level design of an LLM-powered customer support bot
Production support bots quickly become outdated as product details and policies change, leading to inaccurate responses and declining user satisfaction. Systems must continuously use up-to-date knowledge while controlling LLM costs.
The following high-level design uses Retrieval-Augmented Generation (RAG) with cost-aware routing to deliver accurate, context-rich responses. The workflow is as follows: a user submits a query through a web or mobile interface, which is handled by the API gateway for authentication, rate limiting, and session tracking before being forwarded to the backend. The RAG pipeline retrieves relevant knowledge from a vector database and augments the prompt, which is then passed to the LLM to generate a response. The response goes through content moderation before being returned to the user, and the conversation is logged with feedback collected.
This architecture ensures the system always references up-to-date product knowledge rather than relying solely on what the model learned during training.
Educative byte: Parametric memory refers to the knowledge baked into an LLM’s weights during training. It becomes stale the moment source documents are updated, which is why retrieval-augmented approaches are essential for production support bots.
With the high-level flow established, the next step is to design APIs for the system.
API design
To support the functional requirements, we define a set of APIs that enable communication between system components and handle key operations within the LLM-powered customer support bot.
sendMessage(): Handles incoming user queries, maintains session context, and returns a response by orchestrating NLU, RAG, and LLM components. It ensures multi-turn dialogue by attaching conversation history to each request.
sendMessage(session_id: string, user_id: string, message: string)
Parameters | Description |
| A unique identifier for the conversation session |
| A unique identifier for the user |
| User’s input text query |
parseQuery(): Extracts user intent and key entities (e.g., order ID, product name) from the input text. This API is used by the NLU service to structure unstructured queries for ...
parseQuery(text: string)
Parameters | Description |
| Raw user query in natural language |