Data Exploration
Learn the basics of the pandas DataFrame and different ways to explore DataFrames.
We'll cover the following
In this lesson, we’ll brush up on our pandas knowledge and data manipulation skills. We’ll start with data exploration techniques, revisit data wrangling methods, and finally check out data modeling techniques.
DataFrame
pandas DataFrames represent data in Excel-like two-dimensional tabular data structures, where we have rows and columns. Let’s start with the dataset import (line 7) and take a look at the top five records with the head()
function of a DataFrame (line 13).To look at the bottom five records instead, use the tail()
function.
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsdf_transactions = pd.read_csv('mock_transactions.csv', header=0)print('Dimension of the dataset')print(df_transactions.shape)print('\nDataset sample')print(df_transactions.head())# df_transactions.tail() # would show the bottom 5 records
To understand the numeric columns better, we can use the DataFrame's describe()
method. This tells us the non-null value count, minimum, maximum, and three quartiles of a column.
# describe the numeric columnsprint(df_transactions.describe())
Accessing a column
Sometimes, we need to take a deeper look at the data in individual columns.
# single columncol = df_transactions['InvoiceNo']print('data type:', type(col))
To count the unique values of a column, we can use the value_counts()
function of the pandas Series.
# count of unique customercol = df_transactions['CustomerID']print(col.value_counts())
There is also a visual way to display the same information.
plt.figure(figsize=(14, 6))sns.countplot(y=df_transactions['CustomerID'])plt.xlabel('Count')plt.ylabel('CustomerID')plt.savefig('output/output.png')
Filtering
There are several ways to filter DataFrames. We can apply the query()
function with a SQL-like query string as a parameter.
result = df_transactions.query('Quantity >= 20')print(result)
Another popular filtering option is to use boolean masking. Boolean masking is actually a two-step process where we prepare a boolean mask DataFrame based on our condition(s) in the first step. Then, we apply that mask over our DataFrame. Whatever value passes through the True
mask is our output.
mask = df_transactions['Quantity'] >= 20print('Step 1: masking')print(mask)result = df_transactions[mask]print('\nStep 2: filter with mask')print(result)
We see that in the first part of the output, rows 5, 6, 7, 9, 18, and 19 hold true. That’s why the DataFrame contains only those records in the second part of the output.