RAG System Design

Explore how to design a scalable Retrieval-Augmented Generation (RAG) system that integrates query embedding, knowledge management, and response generation subsystems. Understand resource estimation for compute, storage, and network needs, and learn detailed subsystem workflows to build reliable, efficient RAG pipelines for real-time user queries.

We'll cover the following...

Resource estimation
High-level design
- Achieving functional requirements
Detailed System Design
- 1. Query understanding and embedding system
- 2. Knowledge management system
3. Response generation and quality control system
Putting it all together
Achieving nonfunctional requirements
Conclusion

The previous lesson covered how RAG works and the underlying algorithms. It also established that a production-ready text-to-text LLM can serve as the generation backbone. In this lesson, we shift focus to System Design, how the different services fit together to form a scalable, reliable, and efficient RAG pipeline capable of serving real-time user queries at scale.

This section covers three areas: resource estimation, high-level architecture, and the detailed design of each subsystem. The first step is resource estimation.

Resource estimation

Before designing a system, it is important to understand how much storage, compute, and network capacity it will actually need. The estimates below are based on a system serving 100 million daily active users (DAUs), each submitting 10 queries per day.

Storage estimation

Storage in a RAG system falls into four categories:

Model weights: A 3-billion-parameter LLM stored in FP16 precision occupies approximately 6 GB. This is a fixed, one-time cost.
Retrieval corpus: The document store that the retrieval system searches over is estimated at 10 TB. This grows only when new documents are added.
User profile data: At roughly 10 KB per user profile, 100 million users require 1 TB of storage.
Interaction data: This is the dominant and growing cost. Each query produces around 100 KB of data (query history, retrieved passages, generated response, logs, and metadata). With 10 queries per user per day:
- Daily interaction data = $\text{100 M users × 10 queries × 100 KB = 100 TB/day}$
- Monthly interaction data = $\text{100 TB × 30 = 3 PB/month}$ ...

1.Introduction to GenAI System Design

2.Fundamental Concepts in GenAI

Breakout Session

3.Back-of-the-envelope Calculations

4.Systematic Framework for Designing GenAI Systems

5.System Design of a Text-to-Text Generation System

Mock Interview

6.System Design of a Text-to-Image Generation System

Mock Interview

7.System Design of a Text-to-Speech Generation System

Mock Interview

8.System Design of a Text-to-Video Generation System

Mock Interview

9.System Design of an Image Captioning System

10.System Design of an Automatic Speech Recognition

11.System Design of Retrieval-Augmented Generation (RAG)

12.Conclusion

13.Free GenAI System Design Lessons

RAG System Design

Resource estimation

Storage estimation