What's the one thing standing between your system and 10x traffic?
Scalable System Design.
Picture this: Your e-commerce site hums along like a well-oiled machine—until a traffic surge hits. Suddenly, your system is overwhelmed, leading to outages, frustrated customers, and lost revenue.
It’s a harsh reality: systems that aren’t built to scale simply can’t handle the pressure, leading to real-world consequences.
Systems that can’t scale don’t just fail—they can hold a business back. But here’s the good news: with 8 proven techniques, you can build systems designed to flex seamlessly with growth.
In this newsletter, you’ll learn:
The core objectives of scalability: load and performance.
When to use vertical, horizontal, or diagonal scaling.
8 advanced techniques like caching, chaos engineering, and predictive scaling.
Lessons from Amazon and Google on mastering scalability.
Onward!
In System Design, scalability is the ability of a system to handle increasing demand—whether it’s more users, higher data volume, or spikes in processing needs—by efficiently adding resources.
Imagine your platform growing from 1 million to 10 million concurrent users. Without scalability, response times lag, servers crash, and users leave. But with scalable design, you can maintain fast response times (e.g., under
That’s why scalability is the backbone of modern application success.
Let’s get started by going over the two main objectives of scalability.
Scalability objectives are the goals a system should meet to handle many users, data, or traffic. They ensure the system’s performance while distributing load across the servers without overburdening them. These objectives can be met by making the system scalable in the following ways:
Load scalability: This involves increasing a system’s capacity to handle many simultaneous operations effectively. It also includes adding resources or optimizing the system’s infrastructure to avoid overburdening the servers or other resources. For example, processing 1 million database queries per second across resources while ensuring CPU utilization remains under 90% or 80%.
Performance scalability: This focuses on enhancing the system to maintain or improve its performance as the load grows. Performance scalability ensures that the response times and processing efficiency remain consistent despite higher demands. For example, handling half a million requests per second without latency exceeding 100 milliseconds even if the demand increases.
While scalability objectives focus on how to handle increasing load, another concept is the types of scalability, which refer to how the system is scaled to meet those objectives.
Now, let’s expand on the types of scalability.
There are two ways to improve a system’s scalability to meet the scalability objectives. The vertical and horizontal scaling, let’s define both of these terms:
Vertical scaling (scaling up): Vertical scaling is increasing the capacity of each machine, such as adding more RAM, upscaling the processor, and adding more hard drive storage. However, vertical scaling becomes restricted at some point due to a machine’s hardware support limitations; for example, the RAM can’t be increased further. Therefore, the concept of horizontal scaling comes into play.
Horizontal scaling (scaling out): Horizontal scaling involves increasing the number of machines instead of their computational or storage resources. For example, if there were 10 servers for 2000 users, and now the number of users has increased to 4000, you could increase the number of servers to 20, possibly to ensure the performance and load of a single machine remain consistent.
Diagonal scalability is another type of scalability in which vertical and horizontal scaling are combined.
The following are some of the common differences between horizontal and vertical scaling:
Horizontal Scaling | Vertical Scaling | |
Maintenance | Maintenance is complex as you will need to manage a lot of machines. | Maintenance is cheaper and less complex because of the number of nodes to be managed. |
Costs | Initial costs are high, but buying commodity machines with low processing power is more affordable. | Beefing up existing machines’ resources increases costs exponentially. |
Fault Tolerance | In case of failure in a machine, others can still provide the service. | Failures will lead to loss of service. |
Communication Complexity | Having multiple machines requires complex protocols for exchanging data between them. | Data exchange becomes relatively straightforward due to one machine. |
Load Balancing | The load can be distributed across different machines. | With a single device, task distribution is a challenge. While multi-threading allows some parallel processing, it’s constrained by the machine’s capacity. |
Can a system be both horizontally and vertically scalable?
For the rest of this article, we’ll consider horizontal scalability as the most efficient way to enhance the scalability of a system. This is because horizontal scaling avoids problems like a single point of failure (SPOF) and is a favorable option for dynamic scaling to meet nonfunctional requirements like scalability, availability, and reliability.
Let’s discuss this further in the following section.
Scaling individual components can improve a system’s scalability. For example, scaling databases, load balancers, and caches positively impact the system’s scalability. Devising optimal strategies for handling failures and increasing system fault tolerance can also affect the system’s scalability.
Below, we'll cover the 8 proven strategies that are essential to achieving scalability in a system:
Scale your databases
Distribute traffic with load balancers
Speed up access with caching
Optimize content delivery with content deliver networks (CDNs)
Plan for demand with predictive scaling
Simplify growth with a microservices architecture
Fortify systems with effective failure manageent
Stress-test with chaos engineering
Let's explore each strategy.
Scaling a database is one of the proven strategies that increase the availability and scalability of a system. Database scaling includes replicating instances across multiple servers and data centers, dividing data into shards, and storing them on different servers and data centers. Let’s discuss each of these strategies in more detail:
Database replication is the process of creating and maintaining multiple copies of the same data on multiple servers. Each copy of the data is called a replica, which increases the data’s durability and helps increase the system’s performance and fault tolerance. There are two types of database replication:
Primary-secondary replication: In this replication model, one database server, known as primary, handles write operations such as inserts, updates, and deletes, while one or more secondary servers handle read-only operations. The primary node continuously propagates changes to the secondary nodes to keep them updated and synchronized with the primary node. The primary-secondary replication model enhances the system’s read scalability.
Multi-leader replication: In this replication model, some of the database servers act as the leader (primary) nodes, handling both read and write operations, while the follower nodes serve only the read operations. This setup is designed to handle high write loads. In multi-leader replication, conflicts can arise because multiple nodes can make conflicting updates to the data. Therefore, various strategies are used to resolve conflicts, such as last-write-wins.
In System Design, data is divided into multiple, small, and more manageable pieces called shards into different database servers. Each shard is a subset of the data and operates as an independent database instance, handling a portion of the overall workload. Sharding helps in reducing performance bottlenecks and managing large volumes of data. Through database sharding, distributing the load across multiple servers helps reduce the query handling time and improve system's reliability.
Primarily, there are two types of database sharding:
Horizontal sharding: In a database, when a table becomes very large, it starts affecting the read and write performance of queries. In such a case, horizontal sharding is used to divide a table row-wise into smaller parts—called partitions—and are stored on different servers. Usually, there are two strategies for horizontal sharding:
Key-range based sharding
Hash-based sharding
Vertical sharding: In vertical sharding, large tables are split into smaller ones, separating columns with wide text or binary data. However, we should be cautious about joins between tables, as they may require data to be retrieved from multiple shards. By separating columns into a separate table, vertical sharding is an effective way to speed up data retrieval from tables with large text or binary data fields.
By employing database replication and sharding strategies to database servers in a system, query execution and data retrieval time can be decreased, which, in turn, increases the system’s performance and reliability.
Another crucial component in System Design is a load balancer, which enhances the scalability of a system. A load balancer distributes incoming traffic across multiple servers to avoid overburdening a single server. This helps the system handle many requests, improve performance, and allows to easily add resources based on the increased demand. Apart from scalability, a load balancer also helps increase the availability of a system by redirecting traffic from underperforming or failed servers.
Caching is a key technique for enhancing system scalability, particularly performance scalability. In-memory caching stores frequently accessed data in fast, volatile memory (RAM), allowing the system to retrieve data quickly without needing to query the database repeatedly. This reduces the load on back-end systems, speeds up response times, and helps the system handle higher traffic efficiently.
Note: You might want to explore different caching patterns.
Content delivery networks (CDNs) improve scalability by caching content, such as images or videos, on geographically distributed servers. This ensures that users can access (static) content from a nearby server, reducing latency and offloading traffic from the origin servers.
Note: You might be interested in the design of a content delivery network (CDN).
Both in-memory caching and CDNs play critical roles in improving system performance and scalability by reducing the burden on primary data sources and servers.
Predictive scaling is another strategy that can positively impact maintaining a system’s scalability. It uses machine learning and AI algorithms to predict future resource demands based on historical usage patterns, trends, and external factors (seasonal trends, marketing campaigns, economic factors, regulator changes, etc.). By predicting when additional resources will be needed, the system can automatically scale up or down in advance to meet the estimated demand.
For example, an e-commerce platform can use predictive scaling to handle black Friday traffic surges. They can analyze previous years’ traffic patterns and increase the number of additional resources before the surge happens. This ensures the system handles unexpected traffic spikes smoothly and maintains user experience by preventing outages.
In what scenarios might predictive scaling fail to accurately estimate demand?
In microservice architecture, an application is built using smaller, independent services that can be developed, deployed, and scaled separately. Each service is responsible for a specific function and communicates with other functions via different APIs. Individual services in microservice architecture can be scaled based on the demand without affecting the rest of the system. Separating services allows developers to simplify development, maintenance, and deployment, making it easier to handle the increased complexity of large-scale systems while improving scalability, availability, and fault tolerance.
The following illustration shows a microservices architecture for a payment system showing some services.
What are some common metrics used to measure a system’s scalability?
All these techniques are pivotal to taking your system’s scalability to the next level. However, in a complex system, scalability is not enough. Effective failure management not only improves the system’s availability it can indirectly enable scalability. Let’s explore more on failures management in the following section:
Failures are inevitable in large-scale distributed systems, making it essential to have failover strategies in place to manage them effectively. The failover mechanism automatically switches traffic to backup systems when a failure is encountered in primary servers. These mechanisms ensure minimum downtime and disruption and help maintain the system’s availability and scalability. The existence of automated processes is crucial to diverting traffic. Such processes include redundancy, backup, monitoring, and auto-recovery mechanisms. Let’s explain each of these mechanisms in detail:
In System Design, redundancy plays a crucial role in achieving scalability and availability. Redundancy is ensuring that multiple secondary servers are available to handle requests in case of failure of primary servers. Mainly, there are two types of setups to achieve redundancy in System Design:
Active-active setup: In this setup, multiple servers are available, which share a load of handling requests simultaneously. In this setup, failure of the server doesn’t affect the scalability and availablity of the system.
Active-passive setup: In this setup, some servers actively handle the requests, and some are backup servers. When active servers fail, backup servers are triggered to maintain the system’s scalability and availability.
On the other hand, backups are especially important during catastrophic events that can affect an entire geographic area. They involve having redundant databases and storage servers in multiple regions. A system must have regional backups to preserve user-generated content, metadata, and other data if the primary region experiences a disaster.
Monitoring and auto-recovery ensure optimal performance in evolving systems. Monitoring tools provide real-time insights, allowing administrators to identify bottlenecks and predict resource needs. Auto-recovery mechanisms respond to issues automatically, minimizing downtime and the need for manual intervention. Together, they create a more efficient, fault-tolerant, and flexible infrastructure capable of handling increased loads and complexities.
Note: You may want to read more about the monitoring system and its design.
One important practice is rigorously testing the system by simulating real-world failure scenarios, such as server crashes or network outages.
This proactive approach is called chaos engineering, in which faults and failures in a system are intentionally introduced to identify weaknesses and improve its reliability. Chaos engineering helps us understand how the system responds to unexpected disruptions and failures.
An example of chaos engineering is Chaos Monkey, a tool developed by Netflix to randomly shut down service instances in their production environment. This practice allows Netflix to ensure that its system can automatically recover without any human intervention. This mechanism tests the system’s self-healing process and verifies that redundant and standby services handle the load (traffic) in the event of failure.
Imagine the distributed system of an online market platform. You might need chaos engineering to test how the system reroutes traffic to another region when one region’s data center goes down. Or you might be interested in assessing whether it relies on cached data and how it can handle new transactions during an outage during database failure. These tests increase the system’s ability to scale and recover from failure.
Let’s take the example of Amazon and Google and see how they implement scalability in different services.
Amazon employs a microservices architecture to enhance scalability across its fleet of services. For instance, Amazon experiences traffic surges exceeding typical levels by over 80% during Prime Day. To handle this, they utilize AWS auto-scaling capabilities, dynamically increasing the number of servers in response to real-time traffic demands. Additionally, Amazon’s use of Amazon DynamoDB for data storage allows them to efficiently partition user data, ensuring that each request is served swiftly, regardless of load. A notable instance is the 2018 Prime Day event, where these strategies enabled Amazon to process over 100 million products sold, demonstrating the power of their scalable infrastructure.
Note: You might be interested in a case study on how Amazon scales for prime day.
Google’s approach to scaling is heavily focused on distributed systems. For example, during the 2020 US presidential election, Google’s search system experienced a high spike in traffic, with millions searching for real-time election results. To maintain low latency, Google leveraged Kubernetes to orchestrate its containerized workloads across a global network of data centers, allowing seamless scaling of services like search systems and YouTube.
Furthermore, Google’s Bigtable database, used for storing search indexes, enables them to handle petabytes of data while providing real-time access to users. This distributed architecture ensured the reliability of Google’s services during peak times and allowed for quick adjustments to resource allocation based on user demand.
Let us now discuss some common challenges associated with scaling a system.
While scaling a system is good for preventing bottlenecks and ensuring a seamless user experience, it also introduces several challenges that need to be managed carefully. These challenges can include issues related to resource allocation, where improperly scaled resources may lead to over-provisioning or under-utilization, resulting in increased costs or performance degradation.
Following are some of the common challenges:
Data consistency issues: Maintaining consistent data across multiple nodes can be challenging in a distributed system. A write operation on the primary node may lead to stale or conflicting data on secondary nodes. For example, if two users update their profiles at the same time, one user may see outdated information due to delays in data synchronization. This inconsistency can negatively impact user experience.
Network latency and bandwidth constraints: As a system scales and more components communicate, network latency can become a significant issue. Increased traffic can lead to slower response times and bottlenecks. For instance, if a user requests data from multiple services, any delay in network communication can result in a frustrating user experience.
Complexity in fault tolerance: Detecting and managing failures in a large distributed system is complicated. When a component fails, it is crucial to identify the problem quickly and recover gracefully without affecting the overall system. For example, if a server crashes, the system should reroute requests to healthy servers, but managing this process adds overhead and complexity.
Increased operational overhead: Scaling a system introduces additional operational complexities. As the number of components grows, managing configurations, deployments, monitoring, and troubleshooting becomes increasingly difficult. For instance, coordinating updates across multiple services requires careful planning to avoid downtime.
Const management: With scaling comes increased costs for infrastructure, data transfer, and storage. As more resources are utilized, managing and optimizing these expenses is essential. For example, a sudden spike in user activity may necessitate additional server resources, which can quickly increase operational costs if not monitored effectively.
Dependency management: As systems grow, managing dependencies between components becomes critical. If one service fails, it can create a domino effect, causing other services to fail. For instance, if a payment processing service goes down, it can halt transactions across multiple applications. Therefore, effective dependency management is essential to prevent cascading failures.
Challenges | Description | Real-World Examples |
Data Consistency Issues |
|
|
Network Latency |
|
|
Complexity in Fault Tolerance |
|
|
Increased Operational Overhead |
|
|
Cost Management |
|
|
Dependency Management |
|
|
Scalability is the key to handling today’s demands and preparing for tomorrow’s growth. From predictive scaling to chaos engineering, these 8 strategies empower you to design systems that thrive under pressure.
Whether you're managing e-commerce traffic spikes or scaling data-intensive apps, adopting these techniques ensures your systems are reliable, efficient, and ready for the future.
Take the next step: Explore our course, Grokking the Modern System Design, and master real-world scaling strategies today.
System Design Interviews decide your level and compensation at top tech companies. To succeed, you must design scalable systems, justify trade-offs, and explain decisions under time pressure. Most candidates struggle because they lack a repeatable method. Built by FAANG engineers, this is the definitive System Design Interview course. You will master distributed systems building blocks: databases, caches, load balancers, messaging, microservices, sharding, replication, and consistency, and learn the patterns behind web-scale architectures. Using the RESHADED framework, you will translate open-ended system design problems into precise requirements, explicit constraints, and success metrics, then design modular, reliable solutions. Full Mock Interview practice builds fluency and timing. By the end, you will discuss architectures with Staff-level clarity, tackle unseen questions with confidence, and stand out in System Design Interviews at leading companies.