Search⌘ K
AI Features

LLM Inference Infrastructure

Explore the unique challenges of serving large language models and learn how innovations like KV caching, PagedAttention memory management, speculative decoding, and continuous batching optimize memory and latency. This lesson equips you to design scalable, cost-effective LLM inference systems for production scenarios.

In the previous lesson, you learned how quantization, pruning, and compilation shrink models and speed up individual forward passes. Those techniques apply broadly across ML workloads. But when you move to serving a large language model in production, a fundamentally different bottleneck emerges, one that no amount of weight pruning alone can solve.

Why LLM serving is a different problem

A ResNet classifier takes a fixed-size image, runs a single forward pass, and returns a prediction. The compute cost is deterministic and bounded. A GPT-style language model, by contrast, generates output through autoregressive decodinga sequential process where each new token is produced one at a time, conditioned on every token that came before it.. A chatbot reply might require 10 forward passes or 500, and you cannot know in advance.

This sequential dependency changes the hardware bottleneck entirely. During each decoding step, the model must reload its full set of attention parameters from GPU memory, but it only produces a single token’s worth of new computation. The ratio of arithmetic operations to bytes moved through memory drops dramatically. The GPU’s floating-point units sit idle while the memory bus struggles to feed them data. LLM inference is therefore memory-bandwidth-bound, not compute-bound.

Now imagine designing the serving backend for a conversational AI product where thousands of users send variable-length messages simultaneously. Each concurrent request carries its own growing state, and GPU-hours dominate the operating budget. The infrastructure innovations covered in this lesson exist precisely to make that scenario economically viable.

The following diagram contrasts these two serving paradigms visually.

CNN inference completes in one compute-bound pass while LLM autoregressive decoding loops N times with growing KV cache shifting bottleneck to memory bandwidth
CNN inference completes in one compute-bound pass while LLM autoregressive decoding loops N times with growing KV cache shifting bottleneck to memory bandwidth

With this fundamental distinction established, the next question becomes how to avoid redundant work across those hundreds of sequential forward passes.

KV caching and its role in latency

What the KV cache stores

During each attention computation, the transformer projects the input into key and value tensors for every layer. Without caching, generating token number 200 would require recomputing the key and value projections for all 199 preceding tokens, turning what should be O(n)O(n) ...