Design of Web Crawler

Understand the design of a web crawler and the interaction of components involved in web crawling.

We'll cover the following

Design

This lesson describes the components involved in the design and the workflow to understand the web crawling process with respect to the requirements.

Components

Following are the components needed for our design:

  • Scheduler: This is one of the key components that schedules URLs for crawling. It’s comprised of two components: A priority queue and a relational database.

    1. A priority queue (URL frontier): The queue hosts URLs ready for crawling based on the two properties associated with each entry: PriorityAs a requirement, we need to assign variable priorities to URLs, depending on their content. This attribute defines the precedence of a URL while in the URL frontier. and updates frequencyFor recrawling purposes, we need to define the recrawl frequency for each URL. This attribute ensures a defined number of placements in the URL frontier for each URL..

    2. Relational database: It stores all the URLs along with with the two associated parameters mentioned above. The database gets populated by new requests from the following two input streams:

      • User’s added URLs including seed and runtime added URLs.
      • Crawler’s extracted URLs.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy