Web Scraping with Beautiful Soup
Discover the key features and applications of the Beautiful Soup library.
Up to this point, we have acquired the necessary skills to make HTTP requests and retrieve the HTML document from a website. It's time to delve deeper and extract the relevant information from the DOM.
Introduction
Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents. It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet. Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents. Due to its user-friendly syntax and robust functionality, it has become a preferred choice for developers and data scientists seeking to extract and process web data efficiently. In this lesson, we will explore the key features and applications of the Beautiful Soup library.
Note: It is recommended to inspect the URLs we will use in this lesson in a separate tab to gain a better understanding of the code paths.
Installation
We can install the Beautiful Soup library in any Python environment by running the command pip install beautifulsoup4.
Usage
Let’s briefly look at using it. The prettify() method produces a
Note: To handle the decoding process effectively, it is always better to use
.contentinstead of.textwhile using Beautiful Soup.
Once the document is parsed, the output can be handled as a data structure (tree), and we can access its elements like any other Python object attribute.
Attributes
When discussing objects and attributes, reviewing several significant attributes and their outputs from the created tree is important.
.tagReturns HTML object with the tag selected
It can be used consecutively to reach a specific tab by following its children
.contentsvs.childrenChildren of tags can be found in the
.contentlist. Instead of retrieving the list, we may use the.childrengenerator to iterate through a tag’s children.
.descendantsRecursively returns all the children and their children (all the sub-HTML trees) of the tag
.stringsvs.stripped_strings.stringsreturns all strings in the HTML document, including whitespace characters and strings nested within tags, while.stripped_stringsreturns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.
.parentvs.parents.parentreturns the immediate parent of the current tag, while.parentsreturns an iterator that allows iterating over all the parents of the current tag.
.next_siblingvs.previous_sibling.next_siblingreturns the following sibling tag of the current tag, while.previous_siblingreturns the previous sibling tag of the current tag.
.next_elementvs.previous_element.next_elementreturns the next element in the parse tree after the current element while.previous_elementreturns the previous element in the parse tree before the current element.
Try it yourself
Explore some of the above attributes using the editor below and Quotes to Scrape:
Searching the DOM
In Beautiful Soup, find_all() is a method that searches the entire parse tree of an HTML or XML document and returns a list of all the matching elements. It is a compelling method that can be used to search for any element in the document based on its tag name, attributes, values, and other criteria. The method find() returns the first element of the provided tag, while the find all() method returns all elements of a given tag.
Let’s scrape the data from the Quotes to Scrape website using the find_all method:
Line 8: We first search for all the
<div>elements of the quote's information.Lines 11–14: Then we iterate through all of them and search for the
<span>tag that holds the quote's text for each one. We then extract it using.stringattribute.
Try it yourself
Try doing the same in the code below. Scrape all the authors' names from the first page.
We have successfully retrieved information from the first page, but our goal is to scrape the entire site. To accomplish this, we need to iterate through all the page URLs and retrieve the quotes from each one.
Lines 7–18: In these lines, we define a function
get_quotes()that takes the soup object and scrapes all the quotes text using the code we built above.Line 29: We then inspect the next page element and get that element by specifying its path from the DOM.
Line 31: The last page won't have a next page element, so we check if the
next_pagevariable holds something or has aNONEvalue.Line 34: We extract the following page URL from the element. However, the URL doesn't contain the domain name, so we use
requests.compat.urljoin()function, which joins two URLs together.Line 35: Lastly, we call the
scrape()function with the following page URL and do the whole process again until we reach the last page from the site.
There is an easier way to do the task above. Using a simple for-loop, we can get a list of all the page URLs and then request each one. However, implementing the earlier method helps us understand different approaches that can be useful in more complex scenarios.
Try it yourself
The Quotes to Scrape website displays the top ten tags on the right side. Can you scrape all the URLs for these tags?
Other useful functions
Some other functions can be used in more complex scenarios as follows:
find_parent()/find_parents()find_next_sibling()/find_next_siblings()find_previous_sibling()/find_previous_siblings()find_next()/find_all_next()find_previous()/find_all_previous()
Lines 8–9: We want to extract the
"Top ten tags"text. One way to do it is to get the first tag itemLoveand then find its previous sibling using the functionfind_previous_sibling(), which will return the<h2>tag that holds the text.Lines 12–15: We want to extract the author's name but with the
"by"word.Line 12: First, we get the quote
<span class='text'>element by following its path starting from the<div class='quote'>.Line 13: The
"by"word is the string that immediately follows the<span>element, and this<span>element is the next sibling to<span class='text'>. Thus, we will get the next sibling of the text span and then get its next sibling usingfind_next_sibling(). Lastly, we usefind_next()to return the next element and passstring=Trueto include strings as the following elements.
The above example may be more than necessary for the specific use case, but it demonstrates the utilization of these functions to extract any desired information from the page.
Conclusion
This lesson covered searching and navigating the DOM structure and scraping website information. With this knowledge, it is possible to retrieve the desired data from any website by making appropriate requests and utilizing the functions provided by the Beautiful Soup library.