How to parse datasets from URLs in Python

In machine learning, we often need to use datasets containing the relevant data that we can use to train our models and improve their accuracy in carrying out a task.

A common way to do this is to download the dataset from a web source and then use it in a program through file-handling techniques. This is a common practice but it comes with its disadvantages:

  • Potential data inconsistency: There is a risk of using outdated versions of files. This may cause us to load incorrect data unless the file is being constantly updated.

  • Memory limitations: Loading a large dataset with file handling can be memory-intensive as the entire dataset is loaded into the memory at once which can in turn lead to performance issues.

Alternatively, we can use URLs to load a dataset into our code using the pandas library without first having to download the whole thing. This will allow us to mitigate the problems mentioned above:

  • We won't have to worry about data inconsistencies as pandas will always retrieve the updated dataset.

  • Memory limitations won't be a huge issue either as pandas can use techniques such as chunking, streaming, or lazy loading to efficiently download the data.

Loading the dataset from URLs

We can load the dataset through a URL by using the pandas library. This library provides us with convenient and useful functions to read data from various sources. For example, we'll be using Google Research's Schema-Guided Dialogue (SGD) dataset.

Example

import pandas as pd
url = 'https://raw.githubusercontent.com/PacktPublishing/Mastering-spaCy/main/Chapter10/data/restaurants.json'
dataset = pd.read_json(url, encoding = 'utf-8')
print(dataset.head())

Explanation

  • Line 3: We stored the URL to the variable, url, where the dataset is being hosted.

  • Line 4: We use the read_json() method to read the URL and store it in the dataset variable

  • Line 5: We simply print the result of the dataset.head() function. This will print the first few rows of the dataset we have loaded.

Copyright ©2024 Educative, Inc. All rights reserved