BeautifulSoup is a Python external library used for parsing from HTML and XML files and extracting information. It's used in
It's not a built-in library of Python and needs to be first installed manually using the following command:
pip install BeautifulSoup4
After installing beautifulsoup4, we can import the package in our Python script and use its methods.
First, we read the HTML file before parsing it for information. To perform this task, we pass the file's content to the beautifulsoup constructor. The constructor takes two parameters, and the syntax is the following:
Variable = BeautifulSoup(contentVariable, ’html.parser’)
contentVariable
is the variable that stores the content of the file.
html.parse
is a positional parameter that lets beautifulsoup know to parse the contentVariable
to HTML.
find()
methodAfter Parsing the HTML document, we can use the beautifulsoup methods to find the desired tags, attributes, or anything we like to get. To perform this task, we can use bs4's find()
and find_all()
methods.
The syntax of find()
method is the following:
Variable.find(nameOftags, attrs={attributeName="name"})
The following example will demonstrate the working of the find()
method:
from bs4 import BeautifulSoupwith open('index.html') as f:content = f.read()soup = BeautifulSoup(content, 'html.parser')print(soup.find('meta'))#uncomment the line below to print the whole document indented# print(soup.prettify())# this line will print the content within the meta tag# print(soup.find('meta').text)
The following is a brief explanation of the code above:
Line 1: We import the BeautifulSoup
package used to parse the HTML document.
Line 3–5: We use Python's built-in function to open and read the index.html
document and create an object of BeautifulSoup
by passing the HTML document to the constructor for parsing.
Line 8: This line finds the first instance of the tag meta, and returns a string that prints on the console.
find_all()
methodThe find()
method only returns the first instance of the tag or attribute it takes as the parameter, whereas find_all()
returns all the instances of the list of tags or attributes given in the parameter.
The following is the the syntax of the find_all()
:
Variable.find_all (listOfNames,attrs={attributeName="name"})
By default, it assumes a simple string in the parameter as a tag name, and if we want to find it based on attributes, we need to use the attrs
keyword in the parameter and specify the attributes.
In the following example, read the index.html file and print all the instances of meta
tag:
from bs4 import BeautifulSoupwith open('index.html') as f:content = f.read()soup = BeautifulSoup(content, 'html.parser')all_instances=soup.find_all('meta')# print list using loopfor i in all_instances:print(i)
The following is a brief explanation of the code above:
Line 1: We import the BeautifulSoup
package used to parse the HTML document.
Line 3–5: We use Python's built-in function to open and read the index.html
document and create an object of the BeautifulSoup by passing the HTML document to the constructor for parsing.
Line 8: This line finds all the instances of the tag meta, and returns a list that is stored in all_instances
.
Line 10–12: We use the loop to print the list such that each string prints on a new line on the console.