Our focus in this lesson is to design a rate limiter with the following functional and non-functional requirements.

Functional requirements

Limit the number of requests a client can send to an API within a time window.
The limit of requests per window must be configurable.
The client should get a message (error or notification) whenever the defined threshold is crossed within a single server or combination of servers.

Non-functional requirements

Availability: Essentially, the rate limiter protects our system; therefore, it should be highly available.
Low Latency: As all API requests pass through the rate limiter, it should work with a minimum latency without affecting the user experience.
Scalability: Our design should be highly scalable. It should be able to rate-limit an increasing number of clients’ requests over time.

Types of throttling

There are three types of throttling a rate limiter can perform.

Hard throttling: This type of throttling puts a hard limit on the number of APIs requests. So, whenever a request exceeds the limit, it is discarded.
Soft throttling: Under soft throttling, the number of requests can exceed the predefined limit by a certain percentage. For example, if our system has a predefined limit of 500 messages per minute with a 5% exceed in the limit, we can let the client send 525 requests per minute.
Elastic or Dynamic throttling: In this throttling, the number of requests can cross the predefined limit if the system has excess resources available. However, there is no specific percentage defined for the upper limit. For example, if our system allows 500 requests per minute, it can let the user send more than 500 requests when free resources are available.

Where to put the rate limiter?

There are three different ways to place the rate limiter.

At the client-side: It is easy to place the rate-limiter at the client-side; however, this strategy is not safe as it can easily be tempered by malicious activity. Moreover, the configuration on the client-side is also difficult to apply in this approach.
At the server-side: As shown in the following figure, the rate limiter is placed on the server-side. In this approach, a server receives a request which is passed through the rate-limiter that resides on the server.

Two models for implementing rate-limiter

One rate limiter might not be enough to handle enormous traffic to support millions of users. Therefore, a better option is to use multiple rate-limiters as a cluster of independent nodes. Since there will be numerous rate limiters with their corresponding counters (rate limit); therefore, there are two ways to use databases to store, retrieve and update the counters along with the user information.

Rate-limiter with centralized database: In this approach, rate-limiters interact with a centralized database, preferably Redis or Cassandra. The advantage of this model is that since the counters are stored in centralized databases; therefore, a client can’t exceed the pre-defined limit. However, there are a few drawbacks to this approach. It causes an increase in latency if an enormous number of requests hit the centralized database. Another extensive problem is the potential for race conditions in highly concurrent requests (or associated lock contention).
Rate-limiter with distributed database: Using an independent cluster of nodes is another approach where the rate-limiting state is in a distributed database. In this approach, each node has to track the rate limit. The problem with this approach is that a client could exceed a rate limit (at least momentarily while the state is being collected from everyone) when sending requests to different nodes (rate-limiters). To enforce the limit, we must set up sticky sessions in the load balancer to send each consumer to exactly one node. However, this approach lack fault tolerance and scaling problems when nodes get overloaded.

Apart from the above two concepts, another question is whether to use a global counter shared by all the incoming requests or individual counters per user. For example, the token bucket algorithm can be implemented in two ways. In the first method, all requests can share the total number of tokens in a single bucket, while in the second method, individual buckets are assigned to users. The choice of using shared or separate counters (buckets) depends on the use case and the rate-limiting rules.

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Requirements of a Rate Limiter

Functional requirements

Non-functional requirements

Types of throttling

Where to put the rate limiter?

Two models for implementing rate-limiter

Building blocks we will use