Search⌘ K
AI Features

System Design: Web Crawler

Learn to design a highly scalable web crawler by quantifying resource needs for massive data volumes. Architect a distributed System Design using key components like the HTML fetcher and scheduler. Implement robust mechanisms to handle challenges like crawler traps and rate limiting, ensuring high throughput and data consistency.

Introduction

A web crawler is a bot that systematically scoursTo go or move swiftly about, over, or through in search of something. the World Wide Web (WWW) for content, starting from a pool of seed URLs. It saves the content in data stores for later use. Efficient storage and retrieval are critical for a robust system.

Crawlers fetch web pages, parse content, and extract URLs for further crawling. This is the foundation of search engines. The crawler’s output feeds into subsequent stages:

  • Data cleaning

  • Indexing

  • Relevance scoring (e.g., PageRank)

  • URL frontier management

  • Analytics

This lesson focuses on the crawler’s System Design, excluding downstream stages like indexing or ranking. For those, refer to the chapter on distributed search.

An overview of the web crawler system
An overview of the web crawler system

Benefits of a web crawler

Beyond data collection, web crawlers provide:

  • Web page testing: Validating links and HTML structures.

  • Web page monitoring: Tracking content or structure updates.

  • Site mirroring: Creating mirrorsMirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. of popular websites.

  • Copyright infringement checks: Detecting unauthorized content usage.

Challenges of a web crawler System Design

Designing a crawler involves several challenges:

  • Crawler traps: Infinite loops caused by dynamic links or calendar pages.

  • Duplicate content: Repeatedly crawling the same pages wastes resources.

  • Rate limiting: Fetching too many pages from a single domain overloads servers. We need load balancing to manage this.

  • DNS lookup latency: Frequent DNS lookups slow down the process.

  • Scalability: The system must handle millions of seed URLs and distribute the load across multiple servers.

A web crawler is a common system design interview topic to assess how candidates reason about components like the HTML fetcher, extractor, and scheduler. Interviewers often ask questions such as:

  • How would you design a scalable crawler using Redis for caching and AWS for infrastructure?

  • How would you handle request timeouts and website rate limits?

  • What optimization strategies would you use for parsers and fetchers at a FAANG scale?

  • How do metrics like response time and cache hit rate help evaluate performance?

Let’s begin by defining the requirements.

Requirements

We will highlight the functional and non-functional requirements.

Functional requirements

The system must perform the following:

  • Crawling: Scour the web starting from a queue of seed URLs.

1.

Where do we get these seed URLs from?

Show Answer
1 / 3
  • Storing: Extract and store content in a blob store for indexing and ranking.

  • Scheduling: Regularly schedule crawling to update records.

Non-functional requirements

  • Scalability: The system must be distributed and multithreaded to fetch billions of documents.

  • Extensibility: Support new protocols (beyond HTTP) and file formats via modular extensions.

  • Consistency: Ensure data consistency across all crawling workers.

  • Performance: Use self-throttling to limit crawling per domain (by time or count) to avoid overloading hosts and optimize throughput.

  • Improved UI: Support customized, on-demand crawling beyond routine schedules.

The non-functional requirements of the web crawler system
The non-functional requirements of the web crawler system

With requirements established, we can estimate the scale.

Resource estimation

We will estimate storage, time, and server requirements.

Assumptions

We assume the following:

  • Total web pages: 5 billion

  • Text content per page: 2070 KBA study suggests that the average size of a webpage content is 2070 KB (2.07 MB) based on 892 processed websites.

  • Metadata per page: MetadataIt consists of a web page title and description of the web page showing its purpose. 500 bytes

Storage estimation

The total storage required for 5 billion pages is:
Total storage per crawl=5 Billion×(2070 KB+500 B)=10.35 PB\text{Total storage per crawl} = 5\ \text{Billion} \times (2070\ \text{KB} + 500\ \text{B}) = 10.35\ \text{PB}

The total storage required by the web crawler system
The total storage required by the web crawler system

Traversal time

Assuming an average HTTP traversal time of 60 msWebpage traversal is a function of the page size. Smaller web pages take less than 60 ms, and large pages take longer; however, the number 60 ms is the average traversal time., the time to traverse 5 billion pages is:

Total traversal time=5 Billion×60 ms=0.3 Billion seconds=9.5 years\text{Total traversal time} = 5\ \text{Billion} \times 60\ \text{ms} = 0.3\ \text{Billion seconds} = \text{9.5 years}

A single instance would take 9.5 years. To complete the task in one day, we need a multi-worker architecture.

Server estimation

Assuming one worker per server, we calculate the number of servers needed to finish in one day:

No. of days required by 1 server to complete the task=9.5 years×365 days3468 days\text{No. of days required by 1 server to complete the task} = 9.5\ \text{years} \times 365\ \text{days} \approx 3468\ \text{days}

Since one server takes 3,468 days, we need 3,468 servers to complete the task in a single day.

The number of servers required for the web crawler system
The number of servers required for the web crawler system

Bandwidth estimation

Processing 10.35 PB of data per day requires the following total bandwidth:

10.35 PB86400 sec120 GB/sec960 Gb/sec\frac{10.35\ \text{PB}}{86400\ \text{sec}} \approx 120\ \text{GB/sec} \approx 960\ \text{Gb/sec}

Distributing this load among 3468 servers3468 \ \text{servers}, the bandwidth per server is:

960 Gb/sec3468 servers277 Mb/sec per server\frac{960\ \text{Gb/sec}}{3468\ \text{servers}} \approx 277\ \text{Mb/sec per server}

The total bandwidth required for the web crawler system
The total bandwidth required for the web crawler system

Adjust the assumptions below to see how the estimates change:

Estimates Calculator for the Web Crawler

Number of Webpages5Billion
Text Content per Webpage2070KB
Metadata per Webpage500Bytes
Total Storagef10.35PB
Total Traversal Time on One Serverf9.5Years
Servers Required to Perform Traversal in One Dayf3468Servers
Bandwidth Estimatef958.33Gb/sec

These estimates confirm the need for high-scale tools. We will use the following building blocks:

Building blocks

The main components of our design are:

Building blocks in high-level design
Building blocks in high-level design
  • Scheduler: Schedules crawling events for URLs.

  • DNS: Resolves IP addresses.

  • Cache: Stores fetched documents for quick access.

  • Blob store: Stores crawled content.

We also include these specific components:

  • HTML fetcher: Connects to web hosts to download content.

  • Service host: Manages worker instances.

  • Extractor: Parses URLs and documents from web pages.

  • Duplicate eliminator: Performs deduplication testing on URLs and documents.

The components in a high-level design
The components in a high-level design

We arrange these components to meet our requirements.

Design

This section details the workflow and component interactions.

Components

Key building blocks include:

  • Scheduler: This is one of the key building blocks that schedules URLs for crawling. It’s composed of the following two units: 

    • Priority queue (URL frontier): The queue hosts URLs that are made ready for crawling based on the two properties associated with each entry: priorityAs a requirement, we need to assign variable priorities to URLs, depending on their content. This attribute defines the precedence of a URL while in the URL frontier. and updates frequencyFor recrawling purposes, we need to define the recrawl frequency for each URL. This attribute ensures a defined number of placements in the URL frontier for each URL..

    • Relational database: It stores all the URLs along with the two associated parameters mentioned above. The database gets populated by new requests from the following two input streams:

      • The user’s added URLs, which include seed and runtime added URLs.

      • The crawler’s extracted URLs.

1.

Can we estimate the size of the priority queue? What are the pros and cons of a centralized and distributed priority queue.

Show Answer
1 / 2
  • DNS resolver: Maps hostnames to IP addresses. To reduce latency, we use a custom DNS resolver that caches frequently used IPs.

  • HTML fetcher: Initiates communication with the host server to download content. While primarily focused on HTTP, it is extendable to other protocols.

1.

How does the crawler handle URLs with variable priorities?

Show Answer
Did you find this helpful?
  • Service host: The “brain” of the crawler, managing worker instances. It performs three main tasks:

    1. Manages the multi-worker architecture. Workers request the next URL from the URL frontier.

    2. Resolves IP addresses via the DNS resolver.

    3. Triggers the HTML fetcher using the resolved IP.

Service host interaction with other components
Service host interaction with other components
  • Extractor: Parses the web page to extract URLs and content. It sends URLs to the scheduler and content to the Document Input Stream (DIS) (e.g., Redis) for processing. Once verified as unique, content is stored in the blob store.

  • Duplicate eliminator: Prevents processing the same content twice. It calculates the checksum of each extracted URL and compares it against the URL checksum data store. If a match is found, the URL is discarded; otherwise, it is added to the store.

Dedup testing in action
Dedup testing in action

The duplicate eliminator repeats this process for document content, storing checksums in the document checksum data store.

The proposed design for the duplicate eliminator can be made robust against these two issues:

  1. By using URL redirection, the new URL can pass through the URL dedup test. But, the second stage of the document dedup wouldn’t allow content duplication in the blob storage.

  2. By changing just one Byte in a document, the checksum of the modified document is going to come out different than the original one.

  • Blob store: Because a web crawler is a core component of a search engine, storing and indexing fetched content and related metadata is essential. The system requires distributed storage, such as a blob store, to store large volumes of unstructured data.

The following illustration shows the overall design:

The overall web crawler design
The overall web crawler design

Workflow

  1. Assignment to a worker: The crawler (service host) initiates the process by loading a URL from the URL frontier’s priority queue and assigns it to the available worker.

  2. DNS resolution: The worker sends the incoming URL for DNS resolution. Before resolving the URL, the DNS resolver checks the cache and returns the requested IP address if it’s found. Otherwise, it determines the IP address, sends it back to the worker instance of the crawler, and stores the result in the cache.

1.

Can we use DFS instead of BFS?

Show Answer
Did you find this helpful?
  1. Communication initiation by the HTML fetcher: The worker forwards the URL and the associated IP address to the HTML fetcher, which initiates the communication between the crawler and the host server.

  2. Content extraction: Once the worker establishes the communication, it extracts the URLs and the HTML document from the web page and places the document in a cache for other components to process it.

  3. Dedup testing: The worker sends the extracted URLs and the document for dedup testing to the duplicate eliminator. The duplicate eliminator calculates and compares the checksum of both the URL and the document with the checksum values that have already been stored.The duplicate eliminator discards the incoming request in case of a match. If there’s no match, it places the newly-calculated checksum values in the respective data stores and gives the go-ahead to the extractor to store the content.

  4. Content storing: The extractor sends the newly-discovered URLs to the scheduler, which stores them in the database and sets the values for the priority and recrawl frequency variables.The extractor also writes the required portions of the newly discovered document, currently in the DIS, into the database.

  5. Recrawling: Once a cycle is complete, the crawler goes back to the first point and repeats the same process until the URL frontier queue is empty. The URLs stored in the scheduler’s database have priority and periodicity assigned to them. Enqueuing new URLs into the URL frontier depends on these two factors.

Note: Given the microservices architecture, the design can utilize client-side load balancing.

The slideshow below details the workflow:

Assignment to a worker
1 / 7
Assignment to a worker

Let’s test your understanding with the help of an AI assessment:

AI Powered
Saved
1 Attempts Remaining
Reset
Why and when do we need to recrawl?

Recrawling is the process of revisiting and updating indexed pages. Keeping this in mind, try to answer the following questions:


  1. Why do we need to recrawl?
  2. How do we decide when to recrawl?

We will now refine the design to enhance functionality, performance, and security.

Design improvements

We address specific shortcomings in the initial design:

  • Shortcoming: The design currently supports only HTTP and text extraction.

  • Adjustment:

    • HTML fetcher: Add modules for other protocols (e.g., FTP). The crawler invokes the correct module based on the URL scheme. The subsequent steps will remain the same.

    • Extractor: Add modules to process non-text media (images, videos) from the Document Input Stream (DIS). These are stored in the blob store alongside text.

The extensibility of the HTML fetcher and extractor
The extensibility of the HTML fetcher and extractor

  • Shortcoming: The design lacks details on distributing work among multiple workers.
    Adjustment: Workers dequeue URLs as they become available. We can distribute tasks using:

    • Domain-level assignment: Assign an entire domain to a specific worker (by hashing the hostname). This prevents redundant crawling and supports reverse URL indexing.

    • Range division: Assign ranges of URLs to workers.

    • Per URL crawling: Workers take individual URLs. This requires coordination to avoid collisions.

Crawler traps

A crawler trap is a URL structure that causes indefinite crawling, exhausting resources.

Classification

Traps often result from poor website structure:

  • Query parameters: Useless variations of a page (e.g., HTTP://www.abc.com?query).

  • Internal links: Infinite redirection loops within a domain.

  • Calendar pages: Infinite combinations of dates.

  • Dynamic content: Infinite pages generated from queries.

  • Cyclic directories: Loops like HTTP://www.abc.com/first/second/first/second/....

Classification of web crawler traps
Classification of web crawler traps

Traps can be accidental or intentional (malicious). They waste throughputUseful content crawled in a time period. and negatively impact the website’s SEO.

Identification

We identify traps by analyzing:

  1. URL scheme: Detecting patterns like cyclic directories.

  2. Page count: Flagging domains with an implausible number of pages.

The identification process of web crawler traps
The identification process of web crawler traps

Solution

To avoid traps and optimize resources, we implement the following:

  1. Application logic: Limit crawling on a domain based on page count or depth. The crawler should identify and “no-go” areas.

  2. Robots exclusion protocol: Fetch and adhere to the robots.txt file. This file specifies allowed pages and revisit frequency, preventing unnecessary crawling.

Note: robots.txt does not prevent malicious traps. Other mechanisms must handle those.

  1. Politeness: Adjust crawl speed based on the domain’s time to first byte (TTFB). Slower servers receive slower crawls to avoid timeouts and overload.

These modifications ensure the crawler is robust and efficient. Next, we validate the architecture against our requirements.

Requirements compliance

We evaluate the design against non-functional requirements.

Scalability

The design supports horizontal scaling:

  • Components (schedulers, workers, fetchers, stores) can be added or removed on demand.

  • Consistent hashing distributes hostnames across workers, allowing seamless addition/removal of servers.

Extensibility and modularity

So far, our design is only focusing on a particular type of communication protocol: HTTP. But according to our non-functional requirements, our system’s design should facilitate the inclusion of other network communication protocols like FTP.

To achieve this extensibility, we only need to add additional modules for the newly required communication protocols in the HTML fetcher. The respective modules will then be responsible for making and maintaining the required communications with the host servers.

Along the same lines, we expect our design to extend its functionality for other MIMEA multipurpose Internet mail extension, or MIME type, is an Internet standard that describes the contents of Internet files based on their natures and formats. types as well. The modular approach for different MIME schemes facilitates this requirement. The worker will call the associated MIME’s processing module to extract the content from the document stored in the DIS.

The requirements met by our web crawler
The requirements met by our web crawler

Consistency

The system consists of several crawling workers. Data consistency among crawled content is crucial. So, to avoid data inconsistency and crawl duplication, our system computes the checksums of URLs and documents and compares them with the existing checksums of the URLs and documents in the URL and document checksum data stores, respectively.

Apart from deduplication, to ensure the data consistency by fault-tolerance conditions, all the servers can checkpoint their states to a backup service, such as Amazon S3 or an offline disk, regularly.

Performance

The proposed web crawler’s performance depends on the following factors:

  • URLs crawled per second: We can improve this factor by adding new workers to the system.

  • Utilizing blob storage for content storing: This ensures higher throughput for the massive amount of unstructured data. It also indicates a fast retrieval of the stored content, because a single blob can support up to 500 requests per second.

  • Efficient implementation of the robots.txt file guideline: We can implement this performance factor by having an application-layer logic of setting the highest precedence of robots.txt guidelines while crawling.

  • Self-throttling: We can have various application-level checks to ensure that our web crawler doesn’t hamper the performance of the website host servers by exhausting their resources.

Scheduling

As was established previously, we may need to recrawl URLs on various frequencies. These frequencies are determined by the application of the URL. We can determine the frequency of recrawl in two different ways:

  1. We can assign a default or a specific recrawling frequency to each URL. This assignment depends on the application of the URL defining the priority. A default frequency is assigned to the standard-priority URLs and a higher recrawl frequency is given to the higher-priority URLs.Based on each URL’s associated recrawl frequency, we can decide to enqueue URLs in the priority queue from the scheduler’s database. The priority defines the place of a URL in the queue.

  2. The second method is to have separate queues for various priority URLs, use URLs from high-priority queues first, and subsequently move to the lower-priority URLs.

The following table summarizes the techniques for achieving non-functional requirements in a web crawler system:

Non-Functional Requirements Compliance

Requirement

Techniques

Scalability

  • Addition/removal of different servers based on the increase/decrease in load
  • Consistent hashing to manage server’s addition and removal
  • Regular backup of the servers in Amazon S3 backup service to achieve fault tolerance

Extensibility and Modularity

  • Addition of a newer communication protocol module in the HTML fetcher
  • Addition of new MIME schemes while processing the downloaded document in DIS

Consistency

  • Calculation and comparison of checksums of URLs and Documents in the respective data stores

Performance

  • Increasing the number of workers performing the crawl
  • Blob stores for storing the content
  • High priority to robots.txt file guidelines while crawling
  • Self-throttle at a domain while crawling

Scheduling

  • Pre-defined default recrawl frequency, or
  • Separate queues and their associated frequencies for various priority URLs

Let’s test your understanding with the help of an AI assessment:

AI Powered
Saved
1 Attempts Remaining
Reset
Share strategies for handling IP blocks
Consider a real-time news aggregation platform, where content changes rapidly, and frequent recrawls are crucial for up-to-date information. What immediate steps would you take if the IP faces blocking issues during frequent crawls? Share strategies for handling IP blocks to ensure continuous access to the latest content.

Conclusion

The web crawler system entails a multi-worker design that uses a microservices architecture. Besides achieving the basic crawling functionality, our design provides insights into the potential shortcomings and challenges associated with our design and further rectifies them with appropriate design modifications. The noteworthy features of our design are as follows:

  1. Identification and design modification for crawler traps.

  2. Extensibility of HTML fetching and content extraction modules.