How to find elements by class using Beautiful Soup

Beautiful Soup is a Python library used for web scraping and parsing HTML and XML documents. When working with HTML documents, we often use CSS classes to style and structure elements on a webpage. These CSS classes are essential for applying specific styles or grouping elements with similar characteristics. Sometimes, during web scraping or data extraction tasks, we need to target and retrieve elements based on their class attribute.

Installing Beautiful Soup

Before proceeding, ensure that you have Beautiful Soup installed. If not, you can install it using pip:

pip install beautifulsoup4

Importing Beautiful Soup

First of all, we need to import the BeautifulSoup in our code. Here is how we can import the BeautifulSoup:

from bs4 import BeautifulSoup

Parsing the HTML

To start, we need to parse the HTML document using Beautiful Soup. We can obtain the HTML content from a URL or from a local file. For example, if we have the HTML content in a string called the html_content. We can parse it like this:

soup = BeautifulSoup(html_content, 'html.parser')

Finding elements by class name

Here are the three methods of Beautiful Soup that allow selecting elements by their class name:

  • find()

  • find_all()

  • select()

Using the find() method

The find() method allows us to locate the first element in the HTML document that matches the specified class name. It returns a single element or None if no match is found. We can use the find() to find elements by class name in two ways:

  • Using attrs

  • Using class_

Using attrs

We can find elements by class name by using the attrs parameter provided by the find() method. We will pass a dictionary that contains the 'class' key and the target class name as the value. Here is an example:

main.py
sample.html
from bs4 import BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
header_element = soup.find(attrs={'class':'header'})
print("Element with class: header: \n",header_element)

Using class_

We can also directly use the class_ parameter to find elements with that class name. The class_ attribute is appended with an underscore to avoid conflicts with the Python-reserved keyword 'class'. Here's an example of how to use it:

main.py
sample.html
from bs4 import BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
header_element = soup.find(class_='header')
print("Element with class: header: \n",header_element)

You can read more about the find() method here.

Using the find_all()

The find_all() method allows us to locate all the elements in the HTML document that matches the specified class name. It returns a list of elements or an empty list if no match is found. We can use the same two parameters in the find_all() to find elements by class name:

  • Using attrs

  • Using class_

Using attrs

We can find elements by class name by using the attrs parameter provided by the find_all() method. We will pass a dictionary that contains the 'class' key and the target class name as the value. Here is an example:

main.py
sample.html
from bs4 import BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
header_elements = soup.find_all(attrs={'class':'header'})
print("Elements with class: header:")
for element in header_elements:
print(element)

Using class_

We can also directly use the class_ parameter to find elements with that class name. Here's an example of how to use it:

main.py
sample.html
from bs4 import BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
header_elements = soup.find_all(class_='header')
print("Elements with class: header:")
for element in header_elements:
print(element)

You can read more about the find_all() method here.

Using select()

The select() method allows us to use CSS selectors to find elements, including those with specific class names. The class selector is represented by a dot (.) followed by the class name. For example:

main.py
sample.html
from bs4 import BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
header_elements = soup.select('.header')
print("Elements with class: header:")
for element in header_elements:
print(element)

select also returns a list of all the elements containing specified class.

You can read more about the select() method here.

Accessing the element data

Once we have found the desired elements, we can access their data (e.g., text content, attributes) using various Beautiful Soup methods and attributes. For example:

main.py
sample.html
from bs4 import BeautifulSoup
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
header_elements = soup.select('.header')
print("Elements with class: header:")
for element in header_elements:
print("Class: ",element["class"])
print("Text: \n",element.text)

Note: To study more about attributes and methods of Beautiful Soup, you can read here.

Conclusion

Beautiful Soup is an excellent tool for extracting data from HTML and XML documents. Using its class name search feature, we can easily locate specific elements within the document based on the assigned class names. This ability makes it a powerful choice for web scraping tasks, data extraction, and analysis.

Copyright ©2024 Educative, Inc. All rights reserved