How to parse a web page in PHP

Web page parsing, also known as web scraping or web crawling, is a technique to extract “structured” data from an HTML document. In this modern era of web development, extracting specific information by parsing a web page is a valuable skill and a common requirement. Web scraping requires navigating through the Document Object Modal (DOM), the hierarchical structure of an HTML document. We usually perform web scraping on a website that does not offer data retrieval through API or in a downloadable form.

Use cases for web page parsing

Web page parsing has hundreds of use cases. Some of the most common are listed below:

Lead generation: Companies use web crawling to extract contact information from forums or social media websites to generate leads and find potential customers.
Content/news aggregation: News aggregator services such as Google News use web crawling to collect articles and posts from around the world and show them to their users.
Price monitoring: E-commerce websites use web crawling to parse their competitors’ websites to extract the pricing of different products. Then, they use this information to offer more competitive pricing to their customer.
Search engine indexing: Search engines like Google and Bing parse billions of web pages daily to index new pages and retrieve information against search queries.
Weather forecasting: The climate researchers parse the web pages of different weather forecasting providers to monitor climate patterns.

The process of parsing a web page

A typical flow to scrap a web page is as follows:

Making an HTTP request: The first step is to send an HTTP request to retrieve the target web page as HTML.
Parsing HTML: The next step is to parse the DOM and navigate through different elements to reach the specific area of the website containing the required data.
Extracting data: After reaching the specific area of the website, it’s time to extract that data. The data is typically in the form of text, retrieved from the attributes of the elements.
Cleaning data (optional): Once we have the required data, we might need to clean it. For example, we might need to split the text with a particular delimiter.
Displaying or saving data: When we have the required data in the desired format, we can display it somewhere or save it permanently.

Using Symfony DomCrawler to parse a quotes website

While many packages are available to parse a web page in PHP, Symfony DomCrawler provides an easy and convenient way to traverse the DOM. To start with web scrapping, let’s extract the quotes and their authors listed on the website quotes.toscrape.com using Symfony DomCrawler.

Installing Symfony DomCrawler

Let’s install Symfony DomCrawler with Composer, a dependency management tool for PHP:

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Contents of the quote...”</span>
        <span>by <small class="author" itemprop="author">Author of the quote...</small>
        <a href="/author/author-name">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>

Since the web page has structured data, all the quotes will follow the same HTML structure.

We can see that the quote text is wrapped in a span tag with the text class, and the author’s name is wrapped in another span tag with the author class.

Using XPath to extract the data

The XML Path Language (XPath) is a query language to navigate through the elements in an HTML document. It offers a simple way to target the required elements in the DOM. For example, to select all the elements with the quote class, we can use the following syntax:

<?php

require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;

function fetchHTML($url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    curl_close($ch);
    return $html;
}

$url = "https://quotes.toscrape.com/";
$html = fetchHTML($url);

$crawler = new Crawler($html);

$quotes = $crawler->filterXPath('//div[@class="quote"]')->each(function (Crawler $node, $i) {
    $text = $node->filterXPath('.//span[@class="text"]')->text();
    $author = $node->filterXPath('.//small[@class="author"]')->text();
    return compact('text', 'author');
});
foreach ($quotes as $quote) {
    echo "Quote: {$quote['text']}<br>";
    echo "Author: {$quote['author']}<br>";
    echo "<br><br>";
}
?>

Extracting quotes, text, and author name

Code explanation

Lines 3–4: We import and use the Symfony DomCrawler class.
Lines 6–12: We implement a function to take a URL as input and return its HTML content as output.
Lines 14–15: We pass the URL to fetchHTML and get its HTML content.
Line 17: We create a new crawler instance by passing the HTML.
Line 19: We get and loop over all the div elements with the attribute class="quote".
Line 20: We get the quoted text by extracting the content of the span element with the attribute class="text".
Line 21: We get the quote author by extracting the content of the small element with the attribute class="author".
Line 22: We return all the quotes as an array.
Lines 24–28: We display all the quotes and their authors on the screen.

Conclusion

Extracting the web pages with PHP is an effective way to extract meaningful data. It offers a way to extract the data even when there’s no API or any other official means provided by that website. However, we should consider a few factors while scrapping a web page, such as respecting their terms of service and implementing request throttling.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources