This device is not compatible.
PROJECT
Web Crawling in JavaScript Using Cheerio
In this project, we will crawl a real-world website using features provided by the Cheerio library in Node.js. We will learn to automatically extract URLs from link HTML elements across an entire site. Lastly, we will export the collected data to CSV.
You will learn to:
Understand the fundamentals of crawling a site.
Build an automated software tool that can crawl an entire site.
Populate a set of URLs discovered in the target site.
Export the discovered URLs to CSV.
Skills
Web Scraping
HTML Elements
Data Collection
Prerequisites
Basic understanding of HTTP and the client/server architecture
Basic understanding of JavaScript
Basic understanding of Node.js
Technologies
HTML
Node.js
Cheerio
JavaScript
Project Description
The Cheerio library in Node.js provides a powerful API for parsing HTML documents. It can easily traverse and manipulate HTML structures, making it an ideal choice for data collection and web crawling.
In this project, we will build a Node script to crawl an entire site with Cheerio and its capabilities. We will download one page of a target site with the Node.js Fetch API. Next, we will use Cheerio’s functions to select HTML link elements using CSS selectors, extract their URLs, and repeat this procedure for other site URLs until all pages have been discovered.
Finally, we will take advantage of the Node I/O capabilities to export the scraped data in human-readable CSV format.
Project Tasks
1
Initial Setup
Task 0: Get Started
2
Implement Link Discovery Logic
Task 1: Navigate to a Web Page
Task 2: Select All Link HTML Elements
Task 3: Extract URLs from the Links
Task 4: Filter Out Undesired URLs
Task 5: Encapsulate the Link Discovery Logic in a Function
3
Crawl the Entire Site
Task 6: Initialize Data Structures for Web Crawling
Task 7: Loop through the Pages to Crawl
Task 8: Create a CSV File from the Pages Discovered
Congratulations!