The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, which is used to parse HTML files. It comes in handy for web crawling.
HTMLParser.feed(data)
: used to input data to the HTML parser.
HTMLParser.handle_starttag(tag, attrs)
: used to handle the start tags in the HTML. The parameter tag
contains the opening tag, and the attrs
parameter contains the attribute of that tag.
HTMLParser.handle_endtag(tag, attrs)
: used to handle the end tags in the HTML. The parameter tag
contains the closing tag, and the attrs
parameter contains the attribute of that tag.
HTMLParser.handle_data(data)
: used to handle the data contained between the HTML tags.
HTMLParser.handle_comment(data)
:used to handle HTML comments.
The functions of HTMLParser will be overridden to provide the desired functionality. Note that the class Parser()
inherits from the HTMLParser
class.
from html.parser import HTMLParserclass Parser(HTMLParser):# method to append the start tag to the list start_tags.def handle_starttag(self, tag, attrs):global start_tagsstart_tags.append(tag)# method to append the end tag to the list end_tags.def handle_endtag(self, tag):global end_tagsend_tags.append(tag)# method to append the data between the tags to the list all_data.def handle_data(self, data):global all_dataall_data.append(data)# method to append the comment to the list comments.def handle_comment(self, data):global commentscomments.append(data)start_tags = []end_tags = []all_data = []comments = []# Creating an instance of our class.parser = Parser()# Poviding the input.parser.feed('<html><title>Desserts</title><body><p>''I am a fan of frozen yoghurt.</p><''/body><!--My first webpage--></html>')print("start tags:", start_tags)print("end tags:", end_tags)print("data:", all_data)print("comments", comments)
To learn more, refer to the official documentation.