Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

python

What is HTMLParser.reset() in Python?

Sarvech Qadir

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which we can use to parse HTML files. It comes in handy for web crawling​Web Crawler is an internet bot that is used to systematically browse and surf WWW (World wide web)..

HTMLparser.reset() is one of the methods of HTMLparser. We can use this to reset any instance of HTMLparser class. This method clears the buffer of any unprocessed data.

We mostly call HTMLparser.reset() at instantiation time.

Syntax

HTMLParser.reset()

Code

The code below shows how we use HTML parser to separate start tags, end tags, comments, and data from the HTML string. We can see how using the code differs with and without the use of HTMLparser.reset().

First, we can see how the code performs without the use of HTMLparser.reset().

from html.parser import HTMLParser
class Parser(HTMLParser):
  # method to append the start tag to the list start_tags.
  def handle_starttag(self, tag, attrs):
    global start_tags
    start_tags.append(tag)
    # method to append the end tag to the list end_tags.
  def handle_endtag(self, tag):
    global end_tags
    end_tags.append(tag)
  # method to append the data between the tags to the list all_data.
  def handle_data(self, data):
    global all_data
    all_data.append(data)
  # method to append the comment to the list comments.
  def handle_comment(self, data):
    global comments
    comments.append(data)
start_tags = []
end_tags = []
all_data = []
comments = []
# Creating an instance of our class.
parser = Parser()
# Poviding the input.
parser.feed('<html><title>Desserts</title><body><p>'
            'I am a fan of frozen yoghurt.</p><')
# We can see the input is incomplete. This puts all the 
# incomplete data in the buffer and waits for next input.

print("start tags:", start_tags)
print("end tags:", end_tags)
print("data:", all_data)
print("comments", comments)

# Now we feed more data. This is joined by the old data
# and treated as one domain.
parser.feed('/body><!--My first webpage--></html>')

print("")
print("After next input:")

print("start tags:", start_tags)
print("end tags:", end_tags)
print("data:", all_data)
print("comments", comments)

Now, let’s use HTMLParser.reset().

Since buffer is reset, the new data input using parser.feed will be treated as a separate new input.

from html.parser import HTMLParser
class Parser(HTMLParser):
  # method to append the start tag to the list start_tags.
  def handle_starttag(self, tag, attrs):
    global start_tags
    start_tags.append(tag)
    # method to append the end tag to the list end_tags.
  def handle_endtag(self, tag):
    global end_tags
    end_tags.append(tag)
  # method to append the data between the tags to the list all_data.
  def handle_data(self, data):
    global all_data
    all_data.append(data)
  # method to append the comment to the list comments.
  def handle_comment(self, data):
    global comments
    comments.append(data)
start_tags = []
end_tags = []
all_data = []
comments = []
# Creating an instance of our class.
parser = Parser()
# Poviding the input.
parser.feed('<html><title>Desserts</title><body><p>'
            'I am a fan of frozen yoghurt.</p><')
# We can see the input is incomplete. This puts all the 
# incomplete data in the buffer and waits for next input.

print("start tags:", start_tags)
print("end tags:", end_tags)
print("data:", all_data)
print("comments", comments)

## Now we make use of reset. This will reset any unprecessed data
# and clear buffer. Now next input is treated independent
# of the last input feed.
parser.reset()

print("")
print("After use of reset:")
# feed more input
parser.feed('/body><!--My first webpage--></html>')

print("start tags:", start_tags)
print("end tags:", end_tags)
print("data:", all_data)
print("comments", comments)

RELATED TAGS

python

CONTRIBUTOR

Sarvech Qadir
Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring