System Design: Web Crawler

Learn how to design a scalable web crawler by defining detailed functional and non-functional requirements, estimating storage, bandwidth, and server needs, architecting core modules like fetchers, parsers, and schedulers, addressing challenges such as crawler traps, and evaluating overall design against the requirements.

Introduction

A web crawler is an internet bot that systematically scoursTo go or move swiftly about, over, or through in search of something. the World Wide Web (WWW) for content, starting its operation from a pool of seed URLs. This process of acquiring content from the WWW is called the crawling process. The crawler further saves the content in data stores, ensuring the data is available for later use. Efficient storage and subsequent retrieval of this data are integral to designing a robust system.

The core functionality of a web crawler involves fetching web pages, parsing their content and metadata, and extracting new URLs or lists of URLs for further crawling. This is the first step performed by search engines. The output of the crawling process serves as input for subsequent stages such as:

Data cleaning
Indexing
Relevance scoring using algorithms like PageRank
URL frontier management
Analytics

This specific design problem is focused on web crawlers’ System Design and excludes explanations of the later stages of indexing, ranking in search engines, etc. To learn about some of these subsequent stages, refer to our chapter on distributed search.

Benefits of a Web Crawler

Web crawlers offer various utilities beyond data collection:

Web page testing: Web crawlers test the validity of the links and structures of HTML pages.
Web page monitoring: We use web crawlers to monitor the content or structure updates on web pages.
Site mirroring: Web crawlers are an effective way to mirrorMirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. popular websites.
Copyright infringement check: Web crawlers fetch and parse page content and check for copyright infringement issues.

Challenges of a Web Crawler System Design

While designing a web crawler, several challenges arise:

Crawler traps: Infinite loops caused by dynamic links or calendar pages.
Duplicate content: Crawling the same web pages repeatedly wastes resources.
Rate limiting: Fetching too many pages from a single domain can lead to server overload. We need load balancing to balance the loads on web servers or application servers.
DNS lookup latency: Frequent domain name system (DNS) lookups increase latency.
Scalability: Handling large-scale crawling is challenging and demands a distributed system that can process millions of seed URLs and distribute load across multiple web servers.

Designing a web crawler is a common System Design interview question to test candidates’ understanding of components like HTML fetcher, extractor, scheduler, etc. The interviewer can ask the following interesting questions:

How would you design a web crawler system that can handle large datasets, and how would you incorporate Redis for caching and Amazon Web Services (AWS) for scalability?
How would you handle request timeouts and manage rate limits set by websites?
What optimization strategies would you use for components like parser, fetcher, etc., for large-scale use cases like those at FAANG?
How metrics like response time, cache hit rate, etc., help evaluate web crawlers’ performance to crawl large datasets for aggregation.

Let’s begin by defining the requirements of a web crawler system.

Requirements

Let’s highlight the functional and non-functional requirements of a web crawler.

Functional requirements

These are the functionalities a user must be able to perform:

Crawling: The system should scour the WWW, spanning from a queue of seed URLs provided initially by the system administrator.

Storing: The system should be able to extract and store the content of a URL in a blob store. This makes that URL and its content processable by the search engines for indexing and ranking purposes.
Scheduling: Since crawling is a process that’s repeated, the system should have regular scheduling to update its blob stores’ records.

Non-functional requirements

Scalability: The system should inherently be distributed and multithreaded, because it has to fetch hundreds of millions of web documents.
Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols, able to add multiple modules to process, and store various file formats.
Consistency: Since our system involves multiple crawling workers, having data consistency among all of them is necessary.
In the general context, data consistency means the reliability and accuracy of data across a system or dataset. In the web crawler’s context, it refers to the adherence of all the workers to a specific set of rules in their attempt to generate consistent crawled data.
Performance: The system should be smart enough to limit its crawling to a domain, either by time spent or by the count of the visited URLs of that domain. This process is called self-throttling. The URLs crawled per second and the throughput of the content crawled should be optimal.

With our requirements established, we can now quantify the immense scale our system must handle.

Resource estimation

We need to estimate various resource requirements for our design.

Assumptions

These are the assumptions we’ll use when estimating our resource requirements:

There are a total of 5 billion web pages.
The text content per webpage is 2070 KBA study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites..
The metadataIt consists of a webpage title and description of the web page showing its purpose. for one web page is 500 Bytes.

Storage estimation

The collective storage required to store the textual content of 5 billion web pages is:
$\text{Total storage per crawl} = 5\ \text{Billion} \times (2070\ \text{KB} + 500\ \text{B}) = 10.35\ \text{PB}$ ...