This device is not compatible.

PROJECT


Headless Web Scraping Using Puppeteer

In this project, we’ll learn to scrape text, images, and URLs from the web page. We’ll also fetch data using multiple puppeteer commands in the form of HTML elements. Lastly, we’ll automate events using schedulers.

Headless Web Scraping Using Puppeteer

You will learn to:

Scrape text data from web pages.

Scrape HTML data to create PDFs.

Scrape images from web pages.

Schedule the scraping.

Skills

Web Scraping

Data Collection

Task Automation

Prerequisites

Intermediate understanding of JavaScript

Basic understanding of Node.js

Basic understanding of cron

Technologies

Node.js

Puppeteer

JavaScript

Project Description

The Node library Puppeteer is used to control browsers through an API. Initially, it was designed to only work with Chromium-based browsers, but now it supports multiple browsers. It runs in headless mode by default, but it can also be configured to run in a non-headless mode.

In this project, we’ll build a Node application to scrape data from a web-based e-library application using Puppeteer and a headless Chromium browser. Throughout this project, we’ll use multiple puppeteer functions to fetch HTML elements using CSS class names and HTML tags.

Furthermore, we’ll use Node functions to automate the processes on this website.

Project Tasks

1

Introduction

Task 0: Run the NextJS Application

Task 1: Access the Web Page

Task 2: Take a Web Page Screenshot

2

Extract Data

Task 3: Extract the Description from the Text

Task 4: Extract the Links from the Screen

Task 5: Extract Images from the Web Page

Task 6: Save the Extracted Images

Task 7: Create a PDF File from the Collected Data

3

Schedule

Task 8: Automate the Scrapping

Task 9: Use node-cron to Automate Scraping

Congratulations!