LLM Inference Infrastructure

Explore the unique challenges of serving large language models and learn how innovations like KV caching, PagedAttention memory management, speculative decoding, and continuous batching optimize memory and latency. This lesson equips you to design scalable, cost-effective LLM inference systems for production scenarios.

We'll cover the following...

Why LLM serving is a different problem
KV caching and its role in latency
- What the KV cache stores
- Prefill and decode phases
PagedAttention for memory management
- The fragmentation problem
- How PagedAttention works
Speculative decoding
- The draft-then-verify mechanism
- Why verification is nearly free
Static vs. continuous batching
Conclusion

In the previous lesson, you learned how quantization, pruning, and compilation shrink models and speed up individual forward passes. Those techniques apply broadly across ML workloads. But when you move to serving a large language model in production, a fundamentally different bottleneck emerges, one that no amount of weight pruning alone can solve.

Why LLM serving is a different problem

A ResNet classifier takes a fixed-size image, runs a single forward pass, and returns a prediction. The compute cost is deterministic and bounded. A GPT-style language model, by contrast, generates output through autoregressive decodinga sequential process where each new token is produced one at a time, conditioned on every token that came before it.. A chatbot reply might require 10 forward passes or 500, and you cannot know in advance.

This sequential dependency changes the hardware bottleneck entirely. During each decoding step, the model must reload its full set of attention parameters from GPU memory, but it only produces a single token’s worth of new computation. The ratio of arithmetic operations to bytes moved through memory drops dramatically. The GPU’s floating-point units sit idle while the memory bus struggles to feed them data. LLM inference is therefore memory-bandwidth-bound, not compute-bound.

Now imagine designing the serving backend for a conversational AI product where thousands of users send variable-length messages simultaneously. Each concurrent request carries its own growing state, and GPU-hours dominate the operating budget. The infrastructure innovations covered in this lesson exist precisely to make that scenario economically viable.

The following diagram contrasts these two serving paradigms visually.

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

LLM Inference Infrastructure

Why LLM serving is a different problem

KV caching and its role in latency

What the KV cache stores