Search⌘ K
AI Features

Examples of Resource Estimation

Explore practical resource estimation by applying BOTECs to a high-scale service. Quantify server count, storage needs, and bandwidth requirements based on traffic assumptions. Learn to adjust peak load estimates using principles such as the Pareto principle to ensure System Design feasibility and manage costs.

Introduction

We’ll now apply these resource estimation techniques to a Twitter-like service. Using baseline assumptions, we’ll estimate the required number of servers, storage capacity, and network bandwidth.

Number of servers required

We make the following assumptions for a Twitter-like service:

Assumptions:

  • 500 million (M) daily active users (DAU)

  • Average of 20 requests per user per day

  • Single server capacity (64 cores): 64,000 requests per second (RPS)

Estimating the Number of Servers

Daily active users (DAU)500Million
Requests on average / user / day20
Total requests / dayf10Billion
Total requests / secondf115K
Total servers requiredf2
1.

Can you identify a hidden assumption in our calculations above?

Show Answer
Did you find this helpful?

Plausibility test: Always judge if numbers are reasonable. For example, estimating two servers for a service with millions of DAUs is likely incorrect.

Peak capacity

Large services must handle flash crowds. To estimate peak capacity, we assume a worst-case scenario where all daily requests arrive simultaneously. More accurate estimates require request and response distributionsA request-response distribution refers to the statistical pattern or model that describes the timing and frequency of requests made to a system or service and the responses provided by that system or service. For example, we can measure the number and types of incoming requests in 24 hours., which are often available at the prototyping levelThe prototyping level refers to the early stage of product development, where a basic version is created for testing and design validation before full development. It helps identify and address issues early in the process.. Alternatively, we might assume a statistical model like the Poisson distribution.

To simplify, we use DAU as a proxy for peak load in a specific second. This treats the total daily active users as the number of requests per second. The number of servers at peak load is calculated as follows:

Number of servers required for a Twitter-like service
Number of servers required for a Twitter-like service

If all workloads arrive simultaneously and each server handles 64,000 RPS, we would need approximately 157,000 servers. As this is likely unfeasible, we have two options to address the issue.

Improving the RPS of a server

If the peak load assumption holds, we must increase server capacity. For example, if we are limited to 100,000 servers, we must optimize performance:

Increasing RPS from 64,000 to 100,000 requires significant engineering optimization.

Organizations often rely on extensive engineering to improve server RPS:

Example 1: In 2012, WhatsApp managed 2 million concurrent TCP connections per server. By 2017, they used ~700 servers (specific specifications were unclear).

Example 2: A research system optimized for IO sorted one trillion records in 172 minutes using only 25% of the resources of the previous record holder. This represented a 3x improvement in RPS.

These examples show that improving RPS is possible but requires focused R&D and financial investment.

Improving over the peak load assumption

Alternatively, we can adjust the peak load assumption. Using the Pareto principle (80/20 rule), we assume 80%80\% of traffic occurs within 20%20\% of the time. This concentrates 80%80\% of the peak traffic into a 4.8-hour window.

We assume requests are distributed equally within this 4.8-hour window. Concurrent requests versus spread-out requests significantly impact resource needs. Systems built on these assumptions require monitoring to ensure limits are not violated. If load exceeds predictions, we employ techniques like load-shedding, circuit-breakers, and throttling.

1.

Consider a service that hosts the dynamic, personalized website of a large news organization. Due to unexpected events, such as 9/11, flash crowds are visiting the website to find updates. It might be a situation where all the DAUs come in simultaneously.

Such a situation will clearly break our usual load assumptions. Can you think of some way to gracefully degrade the service to meet such an unexpected load?

Show Answer
Did you find this helpful?

Cost of servers

To estimate costs, we select an AWS EC2 m7i.16xlarge instance (64-core processor, 256 GB RAM, 4th-Gen Intel Xeon). The hourly cost is $3.54816 with a 1-year contract.

An approximate cost of an AWS instance
An approximate cost of an AWS instance

We use an EC2 instance from AWS with the following specifications:

EC2 Instance Specifications

Instance Size

vCPU

Memory (GiB)

Instance Storage (GB)

Network Bandwidth (Gbps)

EBS Bandwidth (Gbps)

m7i.16xlarge

64

256

EBS-Only

25

20

The table below details the cost for two, eight, and 157K servers. Costs escalate quickly in the peak load scenario. In real projects, budget constraints are strict requirements.

Cost of Servers

Low Bound Server Cost per Hour

Cost Under 80–20 Assumptions per Hour

Peak Load Cost per Hour

2*$3.548 = $7.096

8*$3.548 = $28.38

157K*$3.548 = $557,061

Storage requirements

We will estimate the annual storage required for new tweets using the following assumptions:

  • 500M daily active users

  • 3 tweets per user per day

  • 10% of tweets contain images; 5% contain video (mutually exclusive)

  • Average sizes: Image (200 KB), Video (3 MB)

  • Tweet text and metadata: 250 bytesHistorically, one tweet was 140 characters, but the limit has been changed to 280 characters. We assume 250 bytes for simplicity.

Daily storage requirements are calculated as follows:

Estimating Storage Requirements

Daily active users (DAU)500M
Daily tweets3
Total requests / dayf1500M
Storage required per tweet250 B
Storage required per image200KB
Storage required per video3MB
Storage for tweetsf375GB
Storage for imagesf30TB
Storage for videosf225TB
Total storagef255TB
  • Total daily storage = 0.375TB + 30 TB + 225 TB255 TB0.375 \text{TB + 30 TB + 225 TB} \approx 255\ \text{TB}.

  • Total annual storage = 365×255 TB = 93.08 PB365 \times 255\ \text{TB = 93.08\ PB}.

The total storage required by Twitter in a year
The total storage required by Twitter in a year

Bandwidth requirements

To estimate bandwidth requirements:

  1. Estimate daily incoming data.

  2. Estimate daily outgoing data.

  3. Divide by the number of seconds in a day to determine Gbps (Gigabits per second).

Incoming traffic: Based on the 255 TB daily storage requirement calculated above, the incoming bandwidth is:

Note: We multiply by 8 to convert bytes (B) into bits (b).

Outgoing traffic: Assume each user views 50 tweets per day. Using the same content ratios (5% video, 10% image), 50 tweets contain 2.5 videos and 5 images. With 500M DAU:

Estimating Bandwidth Requirements

Daily active users (DAU)500M
Daily tweets viewed50per user
Tweets viewed / secondf289K
Bandwidth required for tweetsf0.58Gbps
Bandwidth required for imagesf46.24Gbps
Bandwidth required for videosf346.8Gbps
Total bandwidthf393.62Gbps

Twitter requires 24 Gbps24\ \text{Gbps} of incoming traffic and 393.62 Gbps393.62\ \text{Gbps} of outgoing traffic (assuming uncompressed uploads). Total bandwidth requirements = 24+393.62=417.62 Gbps24 + 393.62 = 417.62 \ \text{Gbps}.

The total bandwidth required by Twitter
The total bandwidth required by Twitter

These calculations depend heavily on assumptions regarding traffic mix (text vs. media) and the read/write ratio.

1.

We came up with the number of 93 PB for storage needs per year. Is this number plausible?

Show Answer
1 / 2

This lesson provides a reusable framework for resource estimation throughout the course. Back-of-the-envelope calculations (BOTECs) help validate whether a design is feasible at a high level. In interviews, they demonstrate how you reason under uncertainty and make defensible assumptions. Interviews rely on rough estimates, but production systems use real workload metrics to refine capacity planning.