Search⌘ K
AI Features

Web Spider Version 4

Explore how to refactor a web spider application using a TaskQueue to manage asynchronous crawling tasks with controlled concurrency. Understand how to schedule and limit parallel downloads, pass the queue through crawling functions, and handle URL bookkeeping for efficient exploration. This lesson provides practical insights into applying callback-based asynchronous control flow patterns in Node.js.

Now that we have our generic queue to execute tasks in a limited parallel flow, let’s use it straightaway to refactor our web spider application.

We’re going to use an instance of TaskQueue as a work backlog; every URL that we want to crawl needs to be appended to the queue as a task. The starting URL will be added as the first task, then every other URL discovered during the crawling process will be added as well. The queue will manage all the scheduling for us, making sure that the number of tasks in progress (that is, the number of pages being downloaded or read from the filesystem) at any given time is never greater than the concurrency limit configured for the given TaskQueue instance.

Adding queue as a new parameter

We’ve already ...