Loading a CSV Dataset From a URL
Learn to import a CSV dataset from a URL.
We'll cover the following
Loading CSV files
The CSV format is popular for storing and transferring data. Files with a .csv
extension are plain text files containing data records with comma-separated values.
Let’s see how we can analyze data from a CSV file using Python by loading the file from a URL.
import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv')print(df.head())
Let’s review the code line by line:
Line 1: We start by first importing the pandas library using
import pandas as pd
.Line 2: We pass the URL of the dataset, enclosed in quotes, to the
read_csv()
function and save the result in thedf
variable.
Note: When we save the dataset inside a variable, we refer to the variable as a DataFrame. A DataFrame is a tabular data structure that contains data represented in rows and columns.
Line 3: We print the first five records of
df
using thedf.head()
function.
Note: We can print more rows by passing that value as an argument to the
head()
function, i.e.,df.head(10)
.
We observe the following facts from the output:
The output contains the first five records of the DataFrame,
df
. These records help us understand how the rest of the data looks.The first column is called the index column and contains the values
0
,1
,2
,3
, and4
.Each row within the DataFrame is assigned a unique index value.
We usually don't use the index column when providing recommendations.
Other than that, we can see that the DataFrame,
df
, contains five columns:sepal_length
,sepal_width
,petal_length
,petal_width
, andspecies
.
Parameters
The read_csv()
function has multiple parameters we can set to apply certain conditions when retrieving data from the data source. Three popular parameters are usecols
, nrows
, and dtype
.
We'll now see how we can apply these three parameters.
The usecols
parameter
To save memory space, we can specify the dataset columns we want to work with using the usecols
parameter and setting its value as the columns we want.
import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv', usecols = ['sepal_length', 'sepal_width'])print(df.head())
Let’s review the code line by line:
Line 1: We first import pandas and other required libraries.
Line 2: While reading the dataset, we set the
usecols
parameter inside theread_csv()
function and assign it a list containing the desired dataset columns, such as['sepal_length', 'sepal_width']
.
Note: We can also pass column numbers instead of names as well, i.e.,
pd.read_csv('http://bit.ly/flowerdataset', usecols = [0, 1])
.
Line 3: We preview the dataset.
From the output, we see that only the desired columns, sepal_length
and sepal_width
, were selected. If we want to select other columns, we add them to the list assigned to usecols
.
The nrows
parameter
Another useful parameter for analysis is nrows
. We use this parameter to set how many rows of data to load from the data source instead of loading the entire dataset.
import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv', nrows = 15)print(df)
Let’s review the code line by line:
Line 1: We import the pandas library.
Line 2: When reading data from the data source using the
read_csv()
function, we use thenrows
parameter to set the number of rows we want to work with.Line 3: We preview the dataset.
From the output, we can see 15 records from our dataset. This is because we set the nrows
parameter to 15
.
The skiprows
parameter
Sometimes we might want to skip the first row of a dataset if it's irrelevant to our analysis. To do this, we use the skiprows
parameter as shown below.
import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv', skiprows = 1)print(df)
Let’s review the code line by line:
Line 1: We first load the required libraries
Line 2: While using the
read_csv
function to read the dataset, we pass theskiprows
parameter and set its value to the number of rows we want to skip from the beginning of the dataset.Line 3: We preview the resulting dataset.
As we can see from the output, the resulting dataset doesn't have the original column names. This is because we set the skiprows
parameter to 1
, which means that the first row of the CSV file (containing the column names) is skipped during the reading process. As a result, the values in the second row of the CSV file become the new column names in the DataFrame, and the index now starts from 0 for the third row, which is where the new rows of the new DataFrame begin.
All in all, it's important to note that excluding the header row from a DataFrame and replacing it with the first row can potentially impact the accuracy of data analysis because the dataset would be incomplete. This operation is only appropriate if the first row in a CSV file contains null values and the second row contains the actual column names.