Evaluation of Web Crawler

Evaluate the proposed web crawler design based on the non-functional requirements.

Fulfilling requirements

Let’s evaluate how our design meets the non-functional requirements of the proposed system.

Scalability

Our design demands that scaling our system horizontally is vital. Therefore, the proposed design incorporates the following design choices to meet the scalability requirements:

  • The system is scalable to handle the ever-increasing number of URLs. The required resources, including schedulers, web crawler workers, HTML fetchers, extractors, and blob stores, are added/removed on demand.

  • In the case of distributed URL frontier, the system utilizes consistent hashing to distribute the hostnames among various crawling workers where each worker is running on a server; adding or removing a crawler server wouldn’t be a problem.

Extensibility and modularity

So far, our design is only focusing on a particular type of communication protocol: HTTP. But as per our non-functional requirements, our system’s design should facilitate the inclusion of other network communication protocols like FTP (File Transfer Protocol).

To achieve this extensibility, we would only need to add additional modules for the newly required communication protocols in the HTML Fetcher. The respective modules will then be responsible for making and maintaining the required communications with the host servers.

Along the same lines, we expect our design to extend its functionality for other MIMEA multipurpose Internet mail extension, or MIME type, is an Internet standard that describes the contents of Internet files based on their natures and formats. types as well. The modular approach for different MIME schemes will facilitate this requirement. The worker will call the associated MIME’s processing module to extract the content from the document stored in Document Input Stream (DIS).

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy