The read_html()
function of the pandas DataFrame module reads the pandas.DataFrame.read_html()
can be used for data wrangling or data scraping. Let's take a closer look at the syntax, parameters, and return values.
pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)
Here are some argument values:
io
: This is a string or path-like object. It can also be a URL or an HTML file itself.match
: This can be a string or a regular expression. It filters data based on match conditions or REs. The default value is .+
, which means any non-empty string match.header
: A list-like object or integer value is used to create the starting column(s) as a header. The default value for this parameter is None
.index_col
: A list-like object or integer value is used to create the index. The default value is None
.skiprows
: This can be a list-like object or an integer showing the indexes skipped. The default is None
.attrs
: This shows a Python dictionary containing the attributes of the table to filter. Also, the default value is None
.na_values
: This is used to handle null, empty, or NaN values.dfs
: This returns a list of DataFrames.
In the below code snippet, we are going to use the pd.read_html()
function to parse an HTML file into a pandas DataFrame.
import pandas as pd # invoking read_html() to load employee.html file df_list = pd.read_html("employee.html") # print out parsed html file data as data frames print(df_list)
main.py
pd.read_html("employee.html")
keyword will load theemployee.html
file as a list of data frames. It is used to parse each table tag as a different data frame.print(df_list)
keyword will print the list of DataFrames.employee.html
This file contains records of three employees as an HTML document.
RELATED TAGS
CONTRIBUTOR
View all Courses