How to read HTML tables using pandas

Web pages are built using HyperText Markup Language (HTML), and it is a very important bedrock for web development. There are different HTML elements with different jobs and tasks. One of these is the HTML table, which is used to present data in a well-formatted way.

Data presented in tabular format may be needed for analysis and other purposes, and the method used to get this data is called web scraping. Python has a library called pandas that has the ability to analyze and manipulate data. It also contains a number of helpful methods.

In this Answer, we will make use of one of these methods, namely read_html() to import data from HTML tables. We can also export this read data using another method called to_html().

The `read_html()` method

When reading HTML tables into a pandas DataFrame, the read_html() method is very helpful. Under the hood, it parses the HTML source code to extract the table elements using BeautifulSoupBeautifulSoup is a Python library used for web scraping purposes to extract data from HTML and XML files..

Read HTML tables from a URL

We can read data from an HTML both on our local machine or from an online resource. Let's look at an example where we read an HTML from this website.

Coding example

Explanation

Line 2: We import the pandas library.
Line 5: We create a variable url with a link pointing to a web page containing HTML tables.
Line 8: We use the read_html() method to read the table from a given URL or HTML file.
Line 11: We want to read the first table on the page, so we specify the index 0 in the list.
Line 12: We print the first 5 rows of the table using the .head() method.

Read HTML tables from a file

Similarly to the way we read HTML tables from a URL, we can also read the tables from a local HTML file on our computer. It is the same process as reading from a URL.

Coding example

Explanation

Line 1: We import the pandas library with the alias pd.
Line 4: We read the HTML tables within the table.html file .
Line 6: We print the total number of tables found in the file.
Line 9: We get the first table, which has the index 0.
Line 11: We also get the second table in the file, which has the index 1.
Line 14: We print the first table.
Line 15: We print the second table.

Conclusion

The read_html() method converts the extracted tables into a data frame, which we can easily read, clean, and analyze. This method makes web scraping easier and faster.

How to read HTML tables using pandas

The read_html() method

Read HTML tables from a URL

Coding example

Explanation

Read HTML tables from a file

Coding example

Explanation

Conclusion

The `read_html()` method