...

/

Requirements of Web Crawler

Requirements of Web Crawler

Understand the requirements to design a web crawler.

Requirements

Let’s highlight the functional and non-functional requirements for a web crawler.

Functional

Below are the functionalities a user might be able to perform:

  • Crawling: The system should scour the World Wide Web, spanning from a queue of seed URLs provided initially by the system administrator.

Food for thought!

1.

From where do we get these seed URLs?

Show Answer
Q1 / Q3
  • Storing: The system should be able to extract and store the content of a URL in a blob store, making that URL, along with its content, processable by the search engines for indexing and ranking purposes.

  • Scheduling: Since crawling is a repeated process, the system should have regular scheduling to update its blob stores records.

Non-functional

  • Scalability: The system should inherently be distributed and multithreaded as it has to fetch hundreds of millions of web documents.

  • Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols and add multiple modules to process and store various file formats.

  • Consistency: Since our system will involve multiple crawling workers, having data consistency among all of them is required.

  • Performance: The system should be smart enough to limit its crawling at a domain, either by time spent or by the count of URLs visited of that domain: self-throttling. The URLs crawled per second, and the throughput of the content crawled should be optimal.

  • Improved user interface - customized scheduling: Besides the default recrawling which is a functional requirement, the system should also support the functionality to perform non-routine customized crawling on the system administrator’s demands.

Estimations

We need to estimate various resources requirements for our design.

Assumptions: Following are the assumptions for our requirements’ estimations:

  • There are a total of 5 Billion webpages.
  • The text content per webpage is 2070KBA study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites..
  • The metadataIt consists of a webpage title and description of the webpage showing its purpose. for one webpage is 500 Bytes.

Storage requirements

The collective storage required to store the textual content of 5 Billion web pages is: Total storage=5 Billion×(2070 KB+500B)=10.35PBTotal\ storage = 5\ Billion \times (2070\ KB + 500B) = 10.35 PB ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy