Examples of Resource Estimation

Explore practical resource estimation by applying BOTECs to a high-scale service. Quantify server count, storage needs, and bandwidth requirements based on traffic assumptions. Learn to adjust peak load estimates using principles such as the Pareto principle to ensure System Design feasibility and manage costs.

We'll cover the following...

Introduction
Number of servers required
- Peak capacity
  - Improving the RPS of a server
  - Improving over the peak load assumption
- Cost of servers
Storage requirements
Bandwidth requirements

Plausibility test: Always judge if numbers are reasonable. For example, estimating two servers for a service with millions of DAUs is likely incorrect.

Peak capacity

Large services must handle flash crowds. To estimate peak capacity, we assume a worst-case scenario where all daily requests arrive simultaneously. More accurate estimates require request and response distributionsA request-response distribution refers to the statistical pattern or model that describes the timing and frequency of requests made to a system or service and the responses provided by that system or service. For example, we can measure the number and types of incoming requests in 24 hours., which are often available at the prototyping levelThe prototyping level refers to the early stage of product development, where a basic version is created for testing and design validation before full development. It helps identify and address issues early in the process.. Alternatively, we might assume a statistical model like the Poisson distribution.

To simplify, we use DAU as a proxy for peak load in a specific second. This treats the total daily active users as the number of requests per second. The number of servers at peak load is calculated as follows:

Increasing RPS from 64,000 to 100,000 requires significant engineering optimization.

Organizations often rely on extensive engineering to improve server RPS:

Example 1: In 2012, WhatsApp managed 2 million concurrent TCP connections per server. By 2017, they used ~700 servers (specific specifications were unclear).

Example 2: A research system optimized for IO sorted one trillion records in 172 minutes using only 25% of the resources of the previous record holder. This represented a 3x improvement in RPS.

These examples show that improving RPS is possible but requires focused R&D and financial investment.

Improving over the peak load assumption

Alternatively, we can adjust the peak load assumption. Using the Pareto principle (80/20 rule), we assume $80\%$ of traffic occurs within $20\%$ of the time. This concentrates $80\%$ of the peak traffic into a 4.8-hour window.

Daily active users (DAU)	500	Million
Requests on average / user / day	20
Total requests / day	f10	Billion
Total requests / second	f115	K
Total servers required	f2

Instance Size	vCPU	Memory (GiB)	Instance Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)
m7i.16xlarge	64	256	EBS-Only	25	20

Low Bound Server Cost per Hour	Cost Under 80–20 Assumptions per Hour	Peak Load Cost per Hour
2*$3.548 = $7.096	8*$3.548 = $28.38	157K*$3.548 = $557,061

Daily active users (DAU)	500	M
Daily tweets	3
Total requests / day	f1500	M
Storage required per tweet	250	B
Storage required per image	200	KB
Storage required per video	3	MB
Storage for tweets	f375	GB
Storage for images	f30	TB
Storage for videos	f225	TB
Total storage	f255	TB

Daily active users (DAU)	500	M
Daily tweets viewed	50	per user
Tweets viewed / second	f289	K
Bandwidth required for tweets	f0.58	Gbps
Bandwidth required for images	f46.24	Gbps
Bandwidth required for videos	f346.8	Gbps
Total bandwidth	f393.62	Gbps

Examples of Resource Estimation

Introduction

Number of servers required

Estimating the Number of Servers

Peak capacity

Improving the RPS of a server

Improving over the peak load assumption

Cost of servers

EC2 Instance Specifications

Cost of Servers

Storage requirements

Estimating Storage Requirements

Bandwidth requirements

Estimating Bandwidth Requirements