Requirements

Let’s highlight the functional and non-functional requirements for a web crawler.

Functional

Below are the functionalities a user might be able to perform:

Crawling: The system should scour the World Wide Web, spanning from a queue of seed URLs provided initially by the system administrator.

Non-functional

Scalability: The system should inherently be distributed and multithreaded as it has to fetch hundreds of millions of web documents.
Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols and add multiple modules to process and store various file formats.
Consistency: Since our system will involve multiple crawling workers, having data consistency among all of them is required.
Performance: The system should be smart enough to limit its crawling at a domain, either by time spent or by the count of URLs visited of that domain: self-throttling. The URLs crawled per second, and the throughput of the content crawled should be optimal.

Estimations

We need to estimate various resources requirements for our design.

Assumptions: Following are the assumptions for our requirements’ estimations:

There are a total of 5 Billion webpages.
The text content per webpage is 2070KBA study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites..
The metadataIt consists of a webpage title and description of the webpage showing its purpose. for one webpage is 500 Bytes.

Storage requirements

The collective storage required to store the textual content of 5 Billion web pages is: $Total\ storage = 5\ Billion \times (2070\ KB + 500B) = 10.35 PB$ ...

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Requirements of Web Crawler

Requirements

Functional

Non-functional

Estimations

Storage requirements