Scaling Trade-Offs: When to Cache, Queue, Replicate, or Shard

Explore the fundamental trade-offs involved in scaling systems using caching, queuing, replication, and sharding. Understand how each technique addresses specific challenges, such as read latency, write throughput, fault tolerance, and asynchronous processing. This lesson helps you identify which scalability pattern fits different workload demands and how to balance consistency, availability, and latency in real-world system design.

We'll cover the following...

Caching to accelerate read-heavy workloads
- Types of caches and common patterns
How queues decouple and scale systems
- Types of queues and potential risks
Replication for fault tolerance and read scaling
Sharding to achieve horizontal scalability
Trade-offs in designing scalable systems

A system that works well for 1,000 users can struggle when the number grows to 1 million.

This challenge is more than adding servers; it needs a sound architectural plan. Building strong distributed systems depends on making clear design trade-offs. Doing well in a System Design interview often comes down to explaining these choices.

Knowing when and why to use techniques like caching, queuing, replication, and sharding is what sets an experienced engineer apart from a beginner. This lesson provides a framework for understanding these four basic scalability patterns.

We will examine the problems each technique solves, its trade-offs, and the signs that show when to use one instead of the other.

Caching to accelerate read-heavy workloads

Caching is the practice of storing frequently accessed data in a temporary, high-speed storage layer, allowing future requests to be served more quickly.

Rather than retrieving information from a slower primary data source—such as a disk-based database—every time it’s needed, an application first checks the cache. If the data is found (a cache hit), it can be returned immediately, resulting in significantly lower latency and improved overall performance.

The primary goals of caching are to reduce latency for end-users and decrease the load on back-end systems.

Consider how a service like YouTube delivers video thumbnails. These images are requested millions of times but rarely change. By storing them in a content delivery network (CDN)—a geographically distributed caching layer—YouTube can serve the images from servers physically closer to users.

This drastically improves page load times and prevents origin servers from being overwhelmed by repetitive requests.

Let’s visualize how this CDN caching mechanism works in practice:

A key principle for deciding when to cache is the read-to-write ratio.

Caching is most effective for data that is read frequently but updated infrequently. Another factor is data volatility, which measures the frequency of data changes over time. Highly volatile data that changes every few seconds is a poor candidate for caching, as the cache would constantly be invalidated.

However, caching introduces complexity.

We must decide on an invalidation strategy to handle stale data. For example, what happens when a user updates their profile picture? The old image must be removed or replaced in the cache to avoid showing outdated content. ...

1.Introduction to System Design

2.Distributed System Fundamentals

3.Communication in Distributed Systems

4.Storage and Data Management

5.Security in System Design

6.Trade-Offs and Real-World Design Principles

7.Wrapping Up Fundamentals of System Design

Scaling Trade-Offs: When to Cache, Queue, Replicate, or Shard

Caching to accelerate read-heavy workloads