Advanced pandas—Going Beyond the Basics/

...

Read Data from the Web

Learn to read data from websites on the internet.

We'll cover the following...

Read from online files (CSV and JSON)
Read from HTML tables on websites
Read from clipboard

Press + to interact

Note: To access data on private S3 buckets, we must provide additional AWS credentials along with the use of the handy boto3 package. The boto3 is a popular Python software development kit (SDK) for interacting with AWS services using Python code.

Beyond CSV files, we can easily read JSON files online with the read_json() function. For example, say we want to import the information on the latest blockchain block on Bitcoin, which comes in JSON format on the Blockchain.info website, as shown below:

Press + to interact

From the output displayed, it’s clear that we have to process the extracted data to resolve several data quality issues, such as unrecognized characters and concatenated column text. It’s a good reminder that the data extracted from the web won’t always come in the perfect shape or form for immediate data analysis, and further data cleaning is necessary.

If read_html() returns too many tables, we can use the match parameter to find tables that contain a specific string. For example, we can define match='Revenue' to retrieve tables containing the text 'Revenue', as shown below:

Press + to interact

Read from clipboard

If we want to quickly copy and paste tabular data directly from a website without using the read_html() functions, we can do so with the read_clipboard() function. Instead of copying data and pasting it into an Excel spreadsheet, the read_clipboard() directly reads the data we have copied (saved onto our clipboard). The read_clipboard() then passes the data into the read_csv() function for us. Let’s say we want to retrieve information about English Premier League managers from Wikipedia:

Press + to interact

Before We Begin

Reading Data into pandas

Combining Data

Reshaping and Manipulating Data

Encoding Data Types

Handling Numerical Data

Handling Categorical Data

Handling Text Data

Handling Time Series Data

Handling Sparse Data Structures

Handling Missing Data

Data Analysis and Visualization with sidetable and Bokeh

Leveraging Further Features of pandas

Utilizing Extended Libraries

Wrap Up

Appendix

Time Series Analysis and Visualization Using Python and Plotly

Read Data from the Web

Read from online files (CSV and JSON)

Read from HTML tables on websites

Read from clipboard