This device is not compatible.

Web Crawling in JavaScript Using Cheerio

PROJECT


Web Crawling in JavaScript Using Cheerio

In this project, we will crawl a real-world website using features provided by the Cheerio library in Node.js. We will learn to automatically extract URLs from link HTML elements across an entire site. Lastly, we will export the collected data to CSV.

Web Crawling in JavaScript Using Cheerio

You will learn to:

Understand the fundamentals of crawling a site.

Build an automated software tool that can crawl an entire site.

Populate a set of URLs discovered in the target site.

Export the discovered URLs to CSV.

Skills

Web Scraping

HTML Elements

Data Collection

Prerequisites

Basic understanding of HTTP and the client/server architecture

Basic understanding of JavaScript

Basic understanding of Node.js

Technologies

HTML

Node.js

Cheerio logo

Cheerio

JavaScript

Project Description

The Cheerio library in Node.js provides a powerful API for parsing HTML documents. It can easily traverse and manipulate HTML structures, making it an ideal choice for data collection and web crawling.

In this project, we will build a Node script to crawl an entire site with Cheerio and its capabilities. We will download one page of a target site with the Node.js Fetch API. Next, we will use Cheerio’s functions to select HTML link elements using CSS selectors, extract their URLs, and repeat this procedure for other site URLs until all pages have been discovered.

Finally, we will take advantage of the Node I/O capabilities to export the scraped data in human-readable CSV format.

Project Tasks

1

Initial Setup

Task 0: Get Started

2

Implement Link Discovery Logic

Task 1: Navigate to a Web Page

Task 2: Select All Link HTML Elements

Task 3: Extract URLs from the Links

Task 4: Filter Out Undesired URLs

Task 5: Encapsulate the Link Discovery Logic in a Function

3

Crawl the Entire Site

Task 6: Initialize Data Structures for Web Crawling

Task 7: Loop through the Pages to Crawl

Task 8: Create a CSV File from the Pages Discovered

Congratulations!