Web Spider Version 4

Explore how to refactor a web spider application using a TaskQueue to manage asynchronous crawling tasks with controlled concurrency. Understand how to schedule and limit parallel downloads, pass the queue through crawling functions, and handle URL bookkeeping for efficient exploration. This lesson provides practical insights into applying callback-based asynchronous control flow patterns in Node.js.

We'll cover the following...

Adding queue as a new parameter
Simplifying the spiderLinks() function
Customizing the concurrency level

Now that we have our generic queue to execute tasks in a limited parallel flow, let’s use it straightaway to refactor our web spider application.

We’re going to use an instance of TaskQueue as a work backlog; every URL that we want to crawl needs to be appended to the queue as a task. The starting URL will be added as the first task, then every other URL discovered during the crawling process will be added as well. The queue will manage all the scheduling for us, making sure that the number of tasks in progress (that is, the number of pages being downloaded or read from the filesystem) at any given time is never greater than the concurrency limit configured for the given TaskQueue instance.

Adding `queue` as a new

...

1.Overview

2.The Node.js Platform

3.The Module System

4.Callbacks and Events

5.Asynchronous Control Flow Patterns with Callbacks

6.Asynchronous Control Flow Patterns with Promises and Async/Await

7.Coding with Streams

8.Creational Design Patterns

9.Structural Design Patterns

10.Behavioral Design Patterns

11.Universal JavaScript for Web Applications

12.Advanced Recipes

13.Scalability and Architectural Patterns

14.Messaging and Integration Patterns

15.Conclusion

Project

Web Spider Version 4

Adding `queue` as a new

Web Spider Version 4

Adding queue as a new

Adding `queue` as a new