Examples of Resource Estimation
Explore practical resource estimation by applying BOTECs to a high-scale service. Quantify server count, storage needs, and bandwidth requirements based on traffic assumptions. Learn to adjust peak load estimates using principles such as the Pareto principle to ensure System Design feasibility and manage costs.
Introduction
We’ll now apply these resource estimation techniques to a Twitter-like service. Using baseline assumptions, we’ll estimate the required number of servers, storage capacity, and network bandwidth.
Number of servers required
We make the following assumptions for a Twitter-like service:
Assumptions:
500 million (M) daily active users (DAU)
Average of 20 requests per user per day
Single server capacity (64 cores): 64,000 requests per second (RPS)
Estimating the Number of Servers
| Daily active users (DAU) | 500 | Million |
| Requests on average / user / day | 20 | |
| Total requests / day | f10 | Billion |
| Total requests / second | f115 | K |
| Total servers required | f2 |
Can you identify a hidden assumption in our calculations above?
Plausibility test: Always judge if numbers are reasonable. For example, estimating two servers for a service with millions of DAUs is likely incorrect.
Peak capacity
Large services must handle flash crowds. To estimate peak capacity, we assume a worst-case scenario where all daily requests arrive simultaneously. More accurate estimates require
To simplify, we use DAU as a proxy for peak load in a specific second. This treats the total daily active users as the number of requests per second. The number of servers at peak load is calculated as follows:
If all workloads arrive simultaneously and each server handles 64,000 RPS, we would need approximately 157,000 servers. As this is likely unfeasible, we have two options to address the issue.
Improving the RPS of a server
If the peak load assumption holds, we must increase server capacity. For example, if we are limited to 100,000 servers, we must optimize performance:
Increasing RPS from 64,000 to 100,000 requires significant engineering optimization.
Organizations often rely on extensive engineering to improve server RPS:
Example 1: In 2012, WhatsApp managed 2 million concurrent TCP connections per server. By 2017, they used ~700 servers (specific specifications were unclear).
Example 2: A research system optimized for IO sorted one trillion records in 172 minutes using only 25% of the resources of the previous record holder. This represented a 3x improvement in RPS.
These examples show that improving RPS is possible but requires focused R&D and financial investment.
Improving over the peak load assumption
Alternatively, we can adjust the peak load assumption. Using the Pareto principle (80/20 rule), we assume
We assume requests are distributed equally within this 4.8-hour window. Concurrent requests versus spread-out requests significantly impact resource needs. Systems built on these assumptions require monitoring to ensure limits are not violated. If load exceeds predictions, we employ techniques like load-shedding, circuit-breakers, and throttling.
Consider a service that hosts the dynamic, personalized website of a large news organization. Due to unexpected events, such as 9/11, flash crowds are visiting the website to find updates. It might be a situation where all the DAUs come in simultaneously.
Such a situation will clearly break our usual load assumptions. Can you think of some way to gracefully degrade the service to meet such an unexpected load?
Cost of servers
To estimate costs, we select an AWS EC2 m7i.16xlarge instance (64-core processor, 256 GB RAM, 4th-Gen Intel Xeon). The hourly cost is $3.54816 with a 1-year contract.
We use an EC2 instance from AWS with the following specifications:
EC2 Instance Specifications
Instance Size | vCPU | Memory (GiB) | Instance Storage (GB) | Network Bandwidth (Gbps) | EBS Bandwidth (Gbps) |
m7i.16xlarge | 64 | 256 | EBS-Only | 25 | 20 |
The table below details the cost for two, eight, and 157K servers. Costs escalate quickly in the peak load scenario. In real projects, budget constraints are strict requirements.
Cost of Servers
Low Bound Server Cost per Hour | Cost Under 80–20 Assumptions per Hour | Peak Load Cost per Hour |
2*$3.548 = $7.096 | 8*$3.548 = $28.38 | 157K*$3.548 = $557,061 |
Storage requirements
We will estimate the annual storage required for new tweets using the following assumptions:
500M daily active users
3 tweets per user per day
10% of tweets contain images; 5% contain video (mutually exclusive)
Average sizes: Image (200 KB), Video (3 MB)
Tweet text and metadata:
250 bytes Historically, one tweet was 140 characters, but the limit has been changed to 280 characters. We assume 250 bytes for simplicity.
Daily storage requirements are calculated as follows:
Estimating Storage Requirements
| Daily active users (DAU) | 500 | M |
| Daily tweets | 3 | |
| Total requests / day | f1500 | M |
| Storage required per tweet | 250 | B |
| Storage required per image | 200 | KB |
| Storage required per video | 3 | MB |
| Storage for tweets | f375 | GB |
| Storage for images | f30 | TB |
| Storage for videos | f225 | TB |
| Total storage | f255 | TB |
Total daily storage =
. Total annual storage =
.
Bandwidth requirements
To estimate bandwidth requirements:
Estimate daily incoming data.
Estimate daily outgoing data.
Divide by the number of seconds in a day to determine Gbps (Gigabits per second).
Incoming traffic: Based on the 255 TB daily storage requirement calculated above, the incoming bandwidth is:
Note: We multiply by 8 to convert bytes (B) into bits (b).
Outgoing traffic: Assume each user views 50 tweets per day. Using the same content ratios (5% video, 10% image), 50 tweets contain 2.5 videos and 5 images. With 500M DAU:
Estimating Bandwidth Requirements
| Daily active users (DAU) | 500 | M |
| Daily tweets viewed | 50 | per user |
| Tweets viewed / second | f289 | K |
| Bandwidth required for tweets | f0.58 | Gbps |
| Bandwidth required for images | f46.24 | Gbps |
| Bandwidth required for videos | f346.8 | Gbps |
| Total bandwidth | f393.62 | Gbps |
Twitter requires
of incoming traffic and of outgoing traffic (assuming uncompressed uploads). Total bandwidth requirements = .
These calculations depend heavily on assumptions regarding traffic mix (text vs. media) and the read/write ratio.
We came up with the number of 93 PB for storage needs per year. Is this number plausible?
This lesson provides a reusable framework for resource estimation throughout the course. Back-of-the-envelope calculations (BOTECs) help validate whether a design is feasible at a high level. In interviews, they demonstrate how you reason under uncertainty and make defensible assumptions. Interviews rely on rough estimates, but production systems use real workload metrics to refine capacity planning.