How to read and find tags of HTML in BeautifulSoup4
BeautifulSoup is a Python external library used for parsing from HTML and XML files and extracting information. It's used in
It's not a built-in library of Python and needs to be first installed manually using the following command:
pip install BeautifulSoup4
After installing beautifulsoup4, we can import the package in our Python script and use its methods.
First, we read the HTML file before parsing it for information. To perform this task, we pass the file's content to the beautifulsoup constructor. The constructor takes two parameters, and the syntax is the following:
Variable = BeautifulSoup(contentVariable, ’html.parser’)
contentVariableis the variable that stores the content of the file.html.parseis a positional parameter that lets beautifulsoup know to parse thecontentVariableto HTML.
The find() method
After Parsing the HTML document, we can use the beautifulsoup methods to find the desired tags, attributes, or anything we like to get. To perform this task, we can use bs4's find() and find_all() methods.
Syntax
The syntax of find() method is the following:
Variable.find(nameOftags, attrs={attributeName="name"})
Example
The following example will demonstrate the working of the find() method:
from bs4 import BeautifulSoupwith open('index.html') as f:content = f.read()soup = BeautifulSoup(content, 'html.parser')print(soup.find('meta'))#uncomment the line below to print the whole document indented# print(soup.prettify())# this line will print the content within the meta tag# print(soup.find('meta').text)
Explanation
The following is a brief explanation of the code above:
Line 1: We import the
BeautifulSouppackage used to parse the HTML document.Line 3–5: We use Python's built-in function to open and read the
index.htmldocument and create an object ofBeautifulSoupby passing the HTML document to the constructor for parsing.Line 8: This line finds the first instance of the tag meta, and returns a string that prints on the console.
The find_all() method
The find() method only returns the first instance of the tag or attribute it takes as the parameter, whereas find_all() returns all the instances of the list of tags or attributes given in the parameter.
Syntax
The following is the the syntax of the find_all() :
Variable.find_all (listOfNames,attrs={attributeName="name"})
By default, it assumes a simple string in the parameter as a tag name, and if we want to find it based on attributes, we need to use the attrs keyword in the parameter and specify the attributes.
Example
In the following example, read the index.html file and print all the instances of meta tag:
from bs4 import BeautifulSoupwith open('index.html') as f:content = f.read()soup = BeautifulSoup(content, 'html.parser')all_instances=soup.find_all('meta')# print list using loopfor i in all_instances:print(i)
Explanation
The following is a brief explanation of the code above:
Line 1: We import the
BeautifulSouppackage used to parse the HTML document.Line 3–5: We use Python's built-in function to open and read the
index.htmldocument and create an object of the BeautifulSoup by passing the HTML document to the constructor for parsing.Line 8: This line finds all the instances of the tag meta, and returns a list that is stored in
all_instances.Line 10–12: We use the loop to print the list such that each string prints on a new line on the console.
Free Resources