In machine learning, we often need to use datasets containing the relevant data that we can use to train our models and improve their accuracy in carrying out a task.
A common way to do this is to download the dataset from a web source and then use it in a program through file-handling techniques. This is a common practice but it comes with its disadvantages:
Potential data inconsistency: There is a risk of using outdated versions of files. This may cause us to load incorrect data unless the file is being constantly updated.
Memory limitations: Loading a large dataset with file handling can be memory-intensive as the entire dataset is loaded into the memory at once which can in turn lead to performance issues.
Alternatively, we can use URLs to load a dataset into our code using the pandas library without first having to download the whole thing. This will allow us to mitigate the problems mentioned above:
We won't have to worry about data inconsistencies as pandas will always retrieve the updated dataset.
Memory limitations won't be a huge issue either as pandas can use techniques such as chunking, streaming, or lazy loading to efficiently download the data.
We can load the dataset through a URL by using the pandas library. This library provides us with convenient and useful functions to read data from various sources. For example, we'll be using Google Research's Schema-Guided Dialogue (SGD) dataset.
import pandas as pdurl = 'https://raw.githubusercontent.com/PacktPublishing/Mastering-spaCy/main/Chapter10/data/restaurants.json'dataset = pd.read_json(url, encoding = 'utf-8')print(dataset.head())
Line 3: We stored the URL to the variable, url
, where the dataset is being hosted.
Line 4: We use the read_json()
method to read the URL and store it in the dataset
variable
Line 5: We simply print the result of the dataset.head()
function. This will print the first few rows of the dataset we have loaded.