Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

html
pandas

How to use the read_html() function to read HTML to a DataFrame

Salman Yousaf

Overview

The read_html() function of the pandas DataFrame module reads the HTMLHyperText Markup Language file into a list of pandas DataFrames, because the pandas module is used only for data analysis. Therefore, pandas.DataFrame.read_html() can be used for data wrangling or data scraping. Let's take a closer look at the syntax, parameters, and return values.

Syntax

pandas.read_html(io,
                match='.+',
                flavor=None,
                header=None,
                index_col=None,
                skiprows=None,
                attrs=None,
                parse_dates=False,
                thousands=',',
                encoding=None,
                decimal='.',
                converters=None,
                na_values=None,
                keep_default_na=True,
                displayed_only=True)

Parameters

Here are some argument values:

  • io: This is a string or path-like object. It can also be a URL or an HTML file itself.
  • match: This can be a string or a regular expression. It filters data based on match conditions or REs. The default value is .+, which means any non-empty string match.
  • header: A list-like object or integer value is used to create the starting column(s) as a header. The default value for this parameter is None.
  • index_col: A list-like object or integer value is used to create the index. The default value is None.
  • skiprows: This can be a list-like object or an integer showing the indexes skipped. The default is None.
  • attrs: This shows a Python dictionary containing the attributes of the table to filter. Also, the default value is None.
  • na_values: This is used to handle null, empty, or NaN values.

Return value

dfs: This returns a list of DataFrames.

Explanation

In the below code snippet, we are going to use the pd.read_html() function to parse an HTML file into a pandas DataFrame.

main.py
employee.html
import pandas as pd
# invoking read_html() to load employee.html file
df_list = pd.read_html("employee.html")
# print out parsed html file data as data frames
print(df_list)

Explanation for main.py

  • Line 3: The pd.read_html("employee.html") keyword will load theemployee.html file as a list of data frames. It is used to parse each table tag as a different data frame.
  • Line 5: The print(df_list) keyword will print the list of DataFrames.

Explanation for employee.html

This file contains records of three employees as an HTML document.

RELATED TAGS

html
pandas
RELATED COURSES

View all Courses

Keep Exploring