Web page parsing, also known as web scraping or web crawling, is a technique to extract “structured” data from an HTML document. In this modern era of web development, extracting specific information by parsing a web page is a valuable skill and a common requirement. Web scraping requires navigating through the Document Object Modal (DOM), the hierarchical structure of an HTML document. We usually perform web scraping on a website that does not offer data retrieval through API or in a downloadable form.
Web page parsing has hundreds of use cases. Some of the most common are listed below:
Lead generation: Companies use web crawling to extract contact information from forums or social media websites to generate leads and find potential customers.
Content/news aggregation: News aggregator services such as Google News use web crawling to collect articles and posts from around the world and show them to their users.
Price monitoring: E-commerce websites use web crawling to parse their competitors’ websites to extract the pricing of different products. Then, they use this information to offer more competitive pricing to their customer.
Search engine indexing: Search engines like Google and Bing parse billions of web pages daily to index new pages and retrieve information against search queries.
Weather forecasting: The climate researchers parse the web pages of different weather forecasting providers to monitor climate patterns.
A typical flow to scrap a web page is as follows:
Making an HTTP request: The first step is to send an HTTP request to retrieve the target web page as HTML.
Parsing HTML: The next step is to parse the DOM and navigate through different elements to reach the specific area of the website containing the required data.
Extracting data: After reaching the specific area of the website, it’s time to extract that data. The data is typically in the form of text, retrieved from the attributes of the elements.
Cleaning data (optional): Once we have the required data, we might need to clean it. For example, we might need to split the text with a particular delimiter.
Displaying or saving data: When we have the required data in the desired format, we can display it somewhere or save it permanently.
While many packages are available to parse a web page in PHP, Symfony DomCrawler provides an easy and convenient way to traverse the DOM. To start with web scrapping, let’s extract the quotes and their authors listed on the website quotes.toscrape.com using Symfony DomCrawler.
Let’s install Symfony DomCrawler with Composer, a dependency management tool for PHP:
composer require symfony/dom-crawler
Before we can parse a web page, we must analyze its DOM to see which page elements we need to parse to extract our required data. Let’s explore the DOM of the quotes website:
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"><span class="text" itemprop="text">“Contents of the quote...”</span><span>by <small class="author" itemprop="author">Author of the quote...</small><a href="/author/author-name">(about)</a></span><div class="tags">Tags:<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"><a class="tag" href="/tag/change/page/1/">change</a><a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a><a class="tag" href="/tag/thinking/page/1/">thinking</a><a class="tag" href="/tag/world/page/1/">world</a></div></div>
Since the web page has structured data, all the quotes will follow the same HTML structure.
We can see that the quote text is wrapped in a span
tag with the text
class, and the author’s name is wrapped in another span
tag with the author
class.
The XML Path Language (XPath) is a query language to navigate through the elements in an HTML document. It offers a simple way to target the required elements in the DOM. For example, to select all the elements with the quote
class, we can use the following syntax:
//div[@class='quote']
In the example above:
//
selects elements from any location in the DOM.
div
selects all div
elements from the DOM.
[@class='quote']
selects all the elements with the attribute class
equal to the value quote
.
Let’s extract all the quotes from the quotes website and display the quote text and author name.
<?php require 'vendor/autoload.php'; use Symfony\Component\DomCrawler\Crawler; function fetchHTML($url) { $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($ch); curl_close($ch); return $html; } $url = "https://quotes.toscrape.com/"; $html = fetchHTML($url); $crawler = new Crawler($html); $quotes = $crawler->filterXPath('//div[@class="quote"]')->each(function (Crawler $node, $i) { $text = $node->filterXPath('.//span[@class="text"]')->text(); $author = $node->filterXPath('.//small[@class="author"]')->text(); return compact('text', 'author'); }); foreach ($quotes as $quote) { echo "Quote: {$quote['text']}<br>"; echo "Author: {$quote['author']}<br>"; echo "<br><br>"; } ?>
Lines 3–4: We import and use the Symfony DomCrawler
class.
Lines 6–12: We implement a function to take a URL as input and return its HTML content as output.
Lines 14–15: We pass the URL to fetchHTML
and get its HTML content.
Line 17: We create a new crawler instance by passing the HTML.
Line 19: We get and loop over all the div
elements with the attribute class="quote"
.
Line 20: We get the quoted text by extracting the content of the span
element with the attribute class="text"
.
Line 21: We get the quote author by extracting the content of the small
element with the attribute class="author"
.
Line 22: We return all the quotes as an array.
Lines 24–28: We display all the quotes and their authors on the screen.
Extracting the web pages with PHP is an effective way to extract meaningful data. It offers a way to extract the data even when there’s no API or any other official means provided by that website. However, we should consider a few factors while scrapping a web page, such as respecting their terms of service and implementing request throttling.