Search⌘ K
AI Features

RAG System Design

Explore how to design a scalable Retrieval-Augmented Generation (RAG) system that integrates query embedding, knowledge management, and response generation subsystems. Understand resource estimation for compute, storage, and network needs, and learn detailed subsystem workflows to build reliable, efficient RAG pipelines for real-time user queries.

The previous lesson covered how RAG works and the underlying algorithms. It also established that a production-ready text-to-text LLM can serve as the generation backbone. In this lesson, we shift focus to System Design, how the different services fit together to form a scalable, reliable, and efficient RAG pipeline capable of serving real-time user queries at scale.

This section covers three areas: resource estimation, high-level architecture, and the detailed design of each subsystem. The first step is resource estimation.

Resource estimation

Before designing a system, it is important to understand how much storage, compute, and network capacity it will actually need. The estimates below are based on a system serving 100 million daily active users (DAUs), each submitting 10 queries per day.

Storage estimation

Storage in a RAG system falls into four categories:

  • Model weights: A 3-billion-parameter LLM stored in FP16 precision occupies approximately 6 GB. This is a fixed, one-time cost.

  • Retrieval corpus: The document store that the retrieval system searches over is estimated at 10 TB. This grows only when new documents are added.

  • User profile data: At roughly 10 KB per user profile, 100 million users require 1 TB of storage.

  • Interaction data: This is the dominant and growing cost. Each query produces around 100 KB of data (query history, retrieved passages, generated response, logs, and metadata). With 10 queries per user per day:

    • Daily interaction data = 100 M users × 10 queries × 100 KB = 100 TB/day\text{100 M users × 10 queries × 100 KB = 100 TB/day}

    • Monthly interaction data = 100 TB × 30 = 3 PB/month\text{100 TB × 30 = 3 PB/month} ...