How to read HTML tables using pandas
Web pages are built using HyperText Markup Language (HTML), and it is a very important bedrock for web development. There are different HTML elements with different jobs and tasks. One of these is the HTML table, which is used to present data in a well-formatted way.
Data presented in tabular format may be needed for analysis and other purposes, and the method used to get this data is called web scraping. Python has a library called pandas that has the ability to analyze and manipulate data. It also contains a number of helpful methods.
In this Answer, we will make use of one of these methods, namely read_html() to import data from HTML tables. We can also export this read data using another method called to_html().
The read_html() method
When reading HTML tables into a pandas DataFrame, the read_html() method is very helpful. Under the hood, it parses the HTML source code to extract the table elements using
Read HTML tables from a URL
We can read data from an HTML both on our local machine or from an online resource. Let's look at an example where we read an HTML from this website.
Coding example
#import the libraryimport pandas as pd#pass the url to readurl = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'#the read_html method recieves the url as a parameterweb_tables = pd.read_html(url)#let's read the first table on the webpage with index 0first_table = web_tables[0]print(first_table.head())
Explanation
Line 2: We import the pandas library.
Line 5: We create a variable
urlwith a link pointing to a web page containing HTML tables.Line 8: We use the
read_html()method to read the table from a given URL or HTML file.Line 11: We want to read the first table on the page, so we specify the index
0in the list.Line 12: We print the first 5 rows of the table using the
.head()method.
Read HTML tables from a file
Similarly to the way we read HTML tables from a URL, we can also read the tables from a local HTML file on our computer. It is the same process as reading from a URL.
Coding example
import pandas as pd#read the tables in the html filehtml_tables = pd.read_html('table.html')#print the number of tables in the fileprint('Number of tables', len(html_tables))#get the first table with index 0First_table = html_tables[0]#get the second table with index 1Second_table = html_tables[1]#print the two tablesprint(First_table)print(Second_table)
Explanation
Line 1: We import the pandas library with the alias
pd.Line 4: We read the HTML tables within the
table.htmlfile .Line 6: We print the total number of tables found in the file.
Line 9: We get the first table, which has the index
0.Line 11: We also get the second table in the file, which has the index
1.Line 14: We print the first table.
Line 15: We print the second table.
Conclusion
The read_html() method converts the extracted tables into a data frame, which we can easily read, clean, and analyze. This method makes web scraping easier and faster.
Free Resources