Introduction

A web crawler is a bot that systematically scoursTo go or move swiftly about, over, or through in search of something. the World Wide Web (WWW) for content, starting from a pool of seed URLs. It saves the content in data stores for later use. Efficient storage and retrieval are critical for a robust system.

Crawlers fetch web pages, parse content, and extract URLs for further crawling. This is the foundation of search engines. The crawler’s output feeds into subsequent stages:

Data cleaning
Indexing
Relevance scoring (e.g., PageRank)
URL frontier management
Analytics

This lesson focuses on the crawler’s System Design, excluding downstream stages like indexing or ranking. For those, refer to the chapter on distributed search.

Benefits of a web crawler

Beyond data collection, web crawlers provide:

Web page testing: Validating links and HTML structures.
Web page monitoring: Tracking content or structure updates.
Site mirroring: Creating mirrorsMirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. of popular websites.
Copyright infringement checks: Detecting unauthorized content usage.

Challenges of a web crawler System Design

Designing a crawler involves several challenges:

Crawler traps: Infinite loops caused by dynamic links or calendar pages.
Duplicate content: Repeatedly crawling the same pages wastes resources.
Rate limiting: Fetching too many pages from a single domain overloads servers. We need load balancing to manage this.
DNS lookup latency: Frequent DNS lookups slow down the process.
Scalability: The system must handle millions of seed URLs and distribute the load across multiple servers.

A web crawler is a common system design interview topic to assess how candidates reason about components like the HTML fetcher, extractor, and scheduler. Interviewers often ask questions such as:

How would you design a scalable crawler using Redis for caching and AWS for infrastructure?
How would you handle request timeouts and website rate limits?
What optimization strategies would you use for parsers and fetchers at a FAANG scale?
How do metrics like response time and cache hit rate help evaluate performance?

Let’s begin by defining the requirements.

Requirements

We will highlight the functional and non-functional requirements.

Functional requirements

The system must perform the following:

Crawling: Scour the web starting from a queue of seed URLs.

Storing: Extract and store content in a blob store for indexing and ranking.
Scheduling: Regularly schedule crawling to update records.

Non-functional requirements

Scalability: The system must be distributed and multithreaded to fetch billions of documents.
Extensibility: Support new protocols (beyond HTTP) and file formats via modular extensions.
Consistency: Ensure data consistency across all crawling workers.
Performance: Use self-throttling to limit crawling per domain (by time or count) to avoid overloading hosts and optimize throughput.

With requirements established, we can estimate the scale.

Resource estimation

We will estimate storage, time, and server requirements.

Assumptions

We assume the following:

Total web pages: 5 billion
Text content per page: 2070 KBA study suggests that the average size of a webpage content is 2070 KB (2.07 MB) based on 892 processed websites.
Metadata per page: MetadataIt consists of a web page title and description of the web page showing its purpose. 500 bytes

Storage estimation

The total storage required for 5 billion pages is:
$\text{Total storage per crawl} = 5\ \text{Billion} \times (2070\ \text{KB} + 500\ \text{B}) = 10.35\ \text{PB}$

Traversal time

Assuming an average HTTP traversal time of 60 msWebpage traversal is a function of the page size. Smaller web pages take less than 60 ms, and large pages take longer; however, the number 60 ms is the average traversal time., the time to traverse 5 billion pages is:

$\text{Total traversal time} = 5\ \text{Billion} \times 60\ \text{ms} = 0.3\ \text{Billion seconds} = \text{9.5 years}$

A single instance would take 9.5 years. To complete the task in one day, we need a multi-worker architecture.

Server estimation

Assuming one worker per server, we calculate the number of servers needed to finish in one day:

$\text{No. of days required by 1 server to complete the task} = 9.5\ \text{years} \times 365\ \text{days} \approx 3468\ \text{days}$

Since one server takes 3,468 days, we need 3,468 servers to complete the task in a single day.

Bandwidth estimation

Processing 10.35 PB of data per day requires the following total bandwidth:

$\frac{10.35\ \text{PB}}{86400\ \text{sec}} \approx 120\ \text{GB/sec} \approx 960\ \text{Gb/sec}$

Distributing this load among $3468 \ \text{servers}$ , the bandwidth per server is:

$\frac{960\ \text{Gb/sec}}{3468\ \text{servers}} \approx 277\ \text{Mb/sec per server}$

We arrange these components to meet our requirements.

Design

This section details the workflow and component interactions.

Components

Key building blocks include:

Scheduler: This is one of the key building blocks that schedules URLs for crawling. It’s composed of the following two units:
- Priority queue (URL frontier): The queue hosts URLs that are made ready for crawling based on the two properties associated with each entry: priorityAs a requirement, we need to assign variable priorities to URLs, depending on their content. This attribute defines the precedence of a URL while in the URL frontier. and updates frequencyFor recrawling purposes, we need to define the recrawl frequency for each URL. This attribute ensures a defined number of placements in the URL frontier for each URL..
- Relational database: It stores all the URLs along with the two associated parameters mentioned above. The database gets populated by new requests from the following two input streams:
  - The user’s added URLs, which include seed and runtime added URLs.
  - The crawler’s extracted URLs.

The duplicate eliminator repeats this process for document content, storing checksums in the document checksum data store.

The proposed design for the duplicate eliminator can be made robust against these two issues:

By using URL redirection, the new URL can pass through the URL dedup test. But, the second stage of the document dedup wouldn’t allow content duplication in the blob storage.
By changing just one Byte in a document, the checksum of the modified document is going to come out different than the original one.

Blob store: Because a web crawler is a core component of a search engine, storing and indexing fetched content and related metadata is essential. The system requires distributed storage, such as a blob store, to store large volumes of unstructured data.

The following illustration shows the overall design:

Communication initiation by the HTML fetcher: The worker forwards the URL and the associated IP address to the HTML fetcher, which initiates the communication between the crawler and the host server.
Content extraction: Once the worker establishes the communication, it extracts the URLs and the HTML document from the web page and places the document in a cache for other components to process it.
Dedup testing: The worker sends the extracted URLs and the document for dedup testing to the duplicate eliminator. The duplicate eliminator calculates and compares the checksum of both the URL and the document with the checksum values that have already been stored.The duplicate eliminator discards the incoming request in case of a match. If there’s no match, it places the newly-calculated checksum values in the respective data stores and gives the go-ahead to the extractor to store the content.
Content storing: The extractor sends the newly-discovered URLs to the scheduler, which stores them in the database and sets the values for the priority and recrawl frequency variables.The extractor also writes the required portions of the newly discovered document, currently in the DIS, into the database.
Recrawling: Once a cycle is complete, the crawler goes back to the first point and repeats the same process until the URL frontier queue is empty. The URLs stored in the scheduler’s database have priority and periodicity assigned to them. Enqueuing new URLs into the URL frontier depends on these two factors.

Note: Given the microservices architecture, the design can utilize client-side load balancing.

The slideshow below details the workflow:

We will now refine the design to enhance functionality, performance, and security.

Design improvements

We address specific shortcomings in the initial design:

Shortcoming: The design currently supports only HTTP and text extraction.
Adjustment:
- HTML fetcher: Add modules for other protocols (e.g., FTP). The crawler invokes the correct module based on the URL scheme. The subsequent steps will remain the same.
- Extractor: Add modules to process non-text media (images, videos) from the Document Input Stream (DIS). These are stored in the blob store alongside text.

Shortcoming: The design lacks details on distributing work among multiple workers.
Adjustment: Workers dequeue URLs as they become available. We can distribute tasks using:
- Domain-level assignment: Assign an entire domain to a specific worker (by hashing the hostname). This prevents redundant crawling and supports reverse URL indexing.
- Range division: Assign ranges of URLs to workers.
- Per URL crawling: Workers take individual URLs. This requires coordination to avoid collisions.

Crawler traps

A crawler trap is a URL structure that causes indefinite crawling, exhausting resources.

Classification

Traps often result from poor website structure:

Query parameters: Useless variations of a page (e.g., HTTP://www.abc.com?query).
Internal links: Infinite redirection loops within a domain.
Calendar pages: Infinite combinations of dates.
Dynamic content: Infinite pages generated from queries.
Cyclic directories: Loops like HTTP://www.abc.com/first/second/first/second/....

Note: robots.txt does not prevent malicious traps. Other mechanisms must handle those.

Politeness: Adjust crawl speed based on the domain’s time to first byte (TTFB). Slower servers receive slower crawls to avoid timeouts and overload.

These modifications ensure the crawler is robust and efficient. Next, we validate the architecture against our requirements.

Requirements compliance

We evaluate the design against non-functional requirements.

Scalability

The design supports horizontal scaling:

Components (schedulers, workers, fetchers, stores) can be added or removed on demand.
Consistent hashing distributes hostnames across workers, allowing seamless addition/removal of servers.

Extensibility and modularity

So far, our design is only focusing on a particular type of communication protocol: HTTP. But according to our non-functional requirements, our system’s design should facilitate the inclusion of other network communication protocols like FTP.

To achieve this extensibility, we only need to add additional modules for the newly required communication protocols in the HTML fetcher. The respective modules will then be responsible for making and maintaining the required communications with the host servers.

Along the same lines, we expect our design to extend its functionality for other MIMEA multipurpose Internet mail extension, or MIME type, is an Internet standard that describes the contents of Internet files based on their natures and formats. types as well. The modular approach for different MIME schemes facilitates this requirement. The worker will call the associated MIME’s processing module to extract the content from the document stored in the DIS.

Consistency

The system consists of several crawling workers. Data consistency among crawled content is crucial. So, to avoid data inconsistency and crawl duplication, our system computes the checksums of URLs and documents and compares them with the existing checksums of the URLs and documents in the URL and document checksum data stores, respectively.

Apart from deduplication, to ensure the data consistency by fault-tolerance conditions, all the servers can checkpoint their states to a backup service, such as Amazon S3 or an offline disk, regularly.

Performance

The proposed web crawler’s performance depends on the following factors:

URLs crawled per second: We can improve this factor by adding new workers to the system.
Utilizing blob storage for content storing: This ensures higher throughput for the massive amount of unstructured data. It also indicates a fast retrieval of the stored content, because a single blob can support up to 500 requests per second.
Efficient implementation of the robots.txt file guideline: We can implement this performance factor by having an application-layer logic of setting the highest precedence of robots.txt guidelines while crawling.
Self-throttling: We can have various application-level checks to ensure that our web crawler doesn’t hamper the performance of the website host servers by exhausting their resources.

Scheduling

As was established previously, we may need to recrawl URLs on various frequencies. These frequencies are determined by the application of the URL. We can determine the frequency of recrawl in two different ways:

We can assign a default or a specific recrawling frequency to each URL. This assignment depends on the application of the URL defining the priority. A default frequency is assigned to the standard-priority URLs and a higher recrawl frequency is given to the higher-priority URLs.Based on each URL’s associated recrawl frequency, we can decide to enqueue URLs in the priority queue from the scheduler’s database. The priority defines the place of a URL in the queue.
The second method is to have separate queues for various priority URLs, use URLs from high-priority queues first, and subsequently move to the lower-priority URLs.

The following table summarizes the techniques for achieving non-functional requirements in a web crawler system:

Non-Functional Requirements Compliance

Requirement	Techniques
Scalability	Addition/removal of different servers based on the increase/decrease in load Consistent hashing to manage server’s addition and removal Regular backup of the servers in Amazon S3 backup service to achieve fault tolerance
Extensibility and Modularity	Addition of a newer communication protocol module in the HTML fetcher Addition of new MIME schemes while processing the downloaded document in DIS
Consistency	Calculation and comparison of checksums of URLs and Documents in the respective data stores
Performance	Increasing the number of workers performing the crawl Blob stores for storing the content High priority to robots.txt file guidelines while crawling Self-throttle at a domain while crawling
Scheduling	Pre-defined default recrawl frequency, or Separate queues and their associated frequencies for various priority URLs

Number of Webpages	5	Billion
Text Content per Webpage	2070	KB
Metadata per Webpage	500	Bytes
Total Storage	f10.35	PB
Total Traversal Time on One Server	f9.5	Years
Servers Required to Perform Traversal in One Day	f3468	Servers
Bandwidth Estimate	f958.33	Gb/sec

System Design: Web Crawler