How to use the read_html() function to read HTML to a DataFrame
Overview
The read_html() function of the pandas DataFrame module reads the pandas.DataFrame.read_html() can be used for data wrangling or data scraping. Let's take a closer look at the syntax, parameters, and return values.
Syntax
pandas.read_html(io,match='.+',flavor=None,header=None,index_col=None,skiprows=None,attrs=None,parse_dates=False,thousands=',',encoding=None,decimal='.',converters=None,na_values=None,keep_default_na=True,displayed_only=True)
Parameters
Here are some argument values:
io: This is a string or path-like object. It can also be a URL or an HTML file itself.match: This can be a string or a regular expression. It filters data based on match conditions or REs. The default value is.+, which means any non-empty string match.header: A list-like object or integer value is used to create the starting column(s) as a header. The default value for this parameter isNone.index_col: A list-like object or integer value is used to create the index. The default value isNone.skiprows: This can be a list-like object or an integer showing the indexes skipped. The default isNone.attrs: This shows a Python dictionary containing the attributes of the table to filter. Also, the default value isNone.na_values: This is used to handle null, empty, or NaN values.
Return value
dfs: This returns a list of DataFrames.
Explanation
In the below code snippet, we are going to use the pd.read_html() function to parse an HTML file into a pandas DataFrame.
main.py
employee.html
import pandas as pd# invoking read_html() to load employee.html filedf_list = pd.read_html("employee.html")# print out parsed html file data as data framesprint(df_list)
Explanation for main.py
- Line 3: The
pd.read_html("employee.html")keyword will load theemployee.htmlfile as a list of data frames. It is used to parse each table tag as a different data frame. - Line 5: The
print(df_list)keyword will print the list of DataFrames.
Explanation for employee.html
This file contains records of three employees as an HTML document.