What is the HTML parser in Python?

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling​.

svg viewer

Methods

  • HTMLParser.feed(data): used to input data to the HTML parser.

  • HTMLParser.handle_starttag(tag, attrs): used to handle the start tags in the HTML. The parameter tag contains the opening tag, and the attrs parameter contains the attribute of that tag.

  • HTMLParser.handle_endtag(tag, attrs): used to handle the end tags in the HTML. The parameter tag contains the closing tag, ​and the attrs parameter contains the attribute of that tag.

  • HTMLParser.handle_data(data): used to handle the data contained between the HTML tags.

  • HTMLParser.handle_comment(data):used to handle HTML comments.


Example

The functions of HTMLParser will be overridden​ to provide the desired functionality. Note that the class Parser() inherits from the HTMLParser class.

from html.parser import HTMLParser
class Parser(HTMLParser):
# method to append the start tag to the list start_tags.
def handle_starttag(self, tag, attrs):
global start_tags
start_tags.append(tag)
# method to append the end tag to the list end_tags.
def handle_endtag(self, tag):
global end_tags
end_tags.append(tag)
# method to append the data between the tags to the list all_data.
def handle_data(self, data):
global all_data
all_data.append(data)
# method to append the comment to the list comments.
def handle_comment(self, data):
global comments
comments.append(data)
start_tags = []
end_tags = []
all_data = []
comments = []
# Creating an instance of our class.
parser = Parser()
# Poviding the input.
parser.feed('<html><title>Desserts</title><body><p>'
'I am a fan of frozen yoghurt.</p><'
'/body><!--My first webpage--></html>')
print("start tags:", start_tags)
print("end tags:", end_tags)
print("data:", all_data)
print("comments", comments)

To learn more, refer to the official documentation.

Copyright ©2024 Educative, Inc. All rights reserved