System Design: Web Crawler

Introduction

A web crawler is a bot that systematically scoursTo go or move swiftly about, over, or through in search of something. the World Wide Web (WWW) for content, starting from a pool of seed URLs. It saves the content in data stores for later use. Efficient storage and retrieval are critical for a robust system.

Crawlers fetch web pages, parse content, and extract URLs for further crawling. This is the foundation of search engines. The crawler’s output feeds into subsequent stages:

Data cleaning
Indexing
Relevance scoring (e.g., PageRank)
URL frontier management
Analytics

This lesson focuses on the crawler’s System Design, excluding downstream stages like indexing or ranking. For those, refer to the chapter on distributed search.

Benefits of a web crawler

Beyond data collection, web crawlers provide:

Web page testing: Validating links and HTML structures.
Web page monitoring: Tracking content or structure updates.
Site mirroring: Creating mirrorsMirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. of popular websites.
Copyright infringement checks: Detecting unauthorized content usage.

Challenges of a web crawler System Design

Designing a crawler involves several challenges:

Crawler traps: Infinite loops caused by dynamic links or calendar pages.
Duplicate content: Repeatedly crawling the same pages wastes resources.
Rate limiting: Fetching too many pages from a single domain overloads servers. We need load balancing to manage this.
DNS lookup latency: Frequent DNS lookups slow down the process.
Scalability: The system must handle millions of seed URLs and distribute the load across multiple servers.

A web crawler is a common system design interview topic to assess how candidates reason about components like the HTML fetcher, extractor, and scheduler. Interviewers often ask questions such as:

How would you design a scalable crawler using Redis for caching and AWS for infrastructure?
How would you handle request timeouts and website rate limits?
What optimization strategies would you use for parsers and fetchers at a FAANG scale?
How do metrics like response time and cache hit rate help evaluate performance?

Let’s begin by defining the requirements.

Requirements

We will highlight the functional and non-functional requirements.

Functional requirements

The system must perform the following:

Crawling: Scour the web starting from a queue of seed URLs.

Storing: Extract and store content in a blob store for indexing and ranking.
Scheduling: Regularly schedule crawling to update records.

Non-functional requirements

Scalability: The system must be distributed and multithreaded to fetch billions of documents.
Extensibility: Support new protocols (beyond HTTP) and file formats via modular extensions.
Consistency: Ensure data consistency across all crawling workers.
Performance: Use self-throttling to limit crawling per domain (by time or count) to avoid overloading hosts and optimize throughput.

With requirements established, we can estimate the scale.

Resource estimation

We will estimate storage, time, and server requirements.

Assumptions

We assume the following:

Total web pages: 5 billion
Text content per page: 2070 KBA study suggests that the average size of a webpage content is 2070 KB (2.07 MB) based on 892 processed websites.
Metadata per page: MetadataIt consists of a web page title and description of the web page showing its purpose. 500 bytes

Storage estimation

The total storage required for 5 billion pages is:
$\text{Total storage per crawl} = 5\ \text{Billion} \times (2070\ \text{KB} + 500\ \text{B}) = 10.35\ \text{PB}$ ...