Read Data from the Web

Learn to read data from websites on the internet.

Read from online files (CSV and JSON)

The exploding volumes of data and activity on the internet mean that websites are a fantastic source of data for us to utilize. If we have the Uniform Resource Locator (URL) to the website hosting remote CSV or JSON files we want, the read_csv() and read_json() functions will do the trick. Let’s say we’d like to read a raw CSV file that is uploaded on a public GitHub repository:

Press + to interact
Sample of a CSV file within a GitHub repository
Sample of a CSV file within a GitHub repository

We can do so by indicating the URL to the remote CSV file and passing it as an argument into read_csv() in the code below:

Press + to interact
# Define GitHub URL containing the CSV file
url = 'https://raw.githubusercontent.com/kennethleungty/Simulated-Annealing-Feature-Selection/main/data/raw/train.csv'
# Parse URL into pandas function
df = pd.read_csv(url)
# Display first 5 rows
print(df.head())

Besides HTTP URLs, read_csv() (or read_json() if dealing with JSON files) is also capable of handling other valid URL schemes, such as ftp, s3, and gs. Let’s suppose we wish to read a CSV file from a public AWS S3 bucket for the NOAA Water-Column Sonar Data Archive.

Press + to interact
Screenshot of NOAA Sonar Data Archive webpage
Screenshot of NOAA Sonar Data Archive webpage

We next specify the S3 bucket URL to the CSV file while establishing an anonymous connection via the storage_options parameter because we aren’t using any credentials for bucket access:

Press + to interact
# Ensure that we have pip install fsspec and s3fs
# Define S3 bucket URL
s3_url = 's3://noaa-wcsd-pds/data/processed/SH1305/bottom/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv'
# Pass URL into pandas function
df = pd.read_csv(s3_url,
storage_options={"anon": True} # Anonymous connection to S3
)
# Display first 5 rows
print(df.head())

Note: To access data on private S3 buckets, we must provide additional AWS credentials along with the use of the handy boto3 package. The boto3 is a popular Python software development kit (SDK) for interacting with AWS services using Python code.

Beyond CSV files, we can easily read JSON files online with the read_json() function. For example, say we want to import the information on the latest blockchain block on Bitcoin, which comes in JSON format on the Blockchain.info website, as shown below:

Press + to interact
Sample of an online JSON file detailing the latest Bitcoin block
Sample of an online JSON file detailing the latest Bitcoin block

Here is the code to read the data stored online in JSON format:

Press + to interact
# Define URL with the JSON data
json_url = 'https://blockchain.info/latestblock'
# Parse URL into function
df = pd.read_json(json_url)
# Display first 5 rows
print(df.head())

Read from HTML tables on websites

Sometimes, we may want to retrieve data from HTML tables available on public websites. For example, we may wish to obtain data on Amazon’s financial performance from the tables displayed on Google Finance.

Press + to interact
The AMZN financials webpage on Google Finance
The AMZN financials webpage on Google Finance

We can retrieve this tabular data directly with the use of read_html(). After parsing the website URL, the function searches for <table> related HTML tags and returns the corresponding content within those tags:

Press + to interact
# URL to Amazon page on Google Finance
url = 'https://www.google.com/finance/quote/AMZN:NASDAQ'
# Retrieve HTML tables directly from website
output = pd.read_html(url)
# Display output
print(output)

After executing the code above, we’ll notice that the output is a list of the table contents. We can then use the square bracket operators to access each table in the list. For example, to access the first identified table from the Google Finance page (i.e., quarterly income statement), we run the following code:

Press + to interact
# Access contents of first table
url = 'https://www.google.com/finance/quote/AMZN:NASDAQ'
table1 = pd.read_html(url)[0]
# Display table contents
print(table1)

To retrieve data from the other tables, we access the corresponding item with the appropriate index of the list. For example, we can retrieve the following table (i.e., the quarterly balance sheet) by incrementing the index by one—that is pd.read_html(url)[1].

Press + to interact
Output table from read_html()
Output table from read_html()

We can see that the output above matches the actual table on the Google Finance page shown below:

Press + to interact
Financials table for AMZN on Google Finance page
Financials table for AMZN on Google Finance page

From the output displayed, it’s clear that we have to process the extracted data to resolve several data quality issues, such as unrecognized characters and concatenated column text. It’s a good reminder that the data extracted from the web won’t always come in the perfect shape or form for immediate data analysis, and further data cleaning is necessary.

If read_html() returns too many tables, we can use the match parameter to find tables that contain a specific string. For example, we can define match='Revenue' to retrieve tables containing the text 'Revenue', as shown below:

Press + to interact
# Access contents of first table
url = 'https://www.google.com/finance/quote/AMZN:NASDAQ'
print(pd.read_html(url, match='Revenue'))

Read from clipboard

If we want to quickly copy and paste tabular data directly from a website without using the read_html() functions, we can do so with the read_clipboard() function. Instead of copying data and pasting it into an Excel spreadsheet, the read_clipboard() directly reads the data we have copied (saved onto our clipboard). The read_clipboard() then passes the data into the read_csv() function for us. Let’s say we want to retrieve information about English Premier League managers from Wikipedia:

Press + to interact
Wikipedia table of managers in the English Premier League (used under Creative Commons Attribution—ShareAlike)
Wikipedia table of managers in the English Premier League (used under Creative Commons Attribution—ShareAlike)

We can highlight the table contents on the website and then press “Ctrl (or Command) + C” on our keyboard to copy the data onto the clipboard. From there, we execute the read_clipboard() function (without any arguments) to transcribe the copied data into a pandas DataFrame:

Press + to interact
# Read saved tabular data from clipboard
pd.read_clipboard()

The output from read_clipboard() will appear like this, with the clipboard data now successfully stored in a DataFrame:

Press + to interact
Output of read_clipboard()
Output of read_clipboard()