Data Exploration

Learn the basics of the pandas DataFrame and different ways to explore DataFrames.

We'll cover the following

In this lesson, we’ll brush up on our pandas knowledge and data manipulation skills. We’ll start with data exploration techniques, revisit data wrangling methods, and finally check out data modeling techniques.

DataFrame

pandas DataFrames represent data in Excel-like two-dimensional tabular data structures, where we have rows and columns. Let’s start with the dataset import (line 7) and take a look at the top five records with the head() function of a DataFrame (line 13).To look at the bottom five records instead, use the tail() function.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df_transactions = pd.read_csv('mock_transactions.csv', header=0)
print('Dimension of the dataset')
print(df_transactions.shape)
print('\nDataset sample')
print(df_transactions.head())
# df_transactions.tail() # would show the bottom 5 records

To understand the numeric columns better, we can use the DataFrame's describe() method. This tells us the non-null value count, minimum, maximum, and three quartiles of a column.

# describe the numeric columns
print(df_transactions.describe())

Accessing a column

Sometimes, we need to take a deeper look at the data in individual columns.

# single column
col = df_transactions['InvoiceNo']
print('data type:', type(col))

To count the unique values of a column, we can use the value_counts() function of the pandas Series.

# count of unique customer
col = df_transactions['CustomerID']
print(col.value_counts())

There is also a visual way to display the same information.

plt.figure(figsize=(14, 6))
sns.countplot(y=df_transactions['CustomerID'])
plt.xlabel('Count')
plt.ylabel('CustomerID')
plt.savefig('output/output.png')

Filtering

There are several ways to filter DataFrames. We can apply the query() function with a SQL-like query string as a parameter.

result = df_transactions.query('Quantity >= 20')
print(result)

Another popular filtering option is to use boolean masking. Boolean masking is actually a two-step process where we prepare a boolean mask DataFrame based on our condition(s) in the first step. Then, we apply that mask over our DataFrame. Whatever value passes through the True mask is our output.

mask = df_transactions['Quantity'] >= 20
print('Step 1: masking')
print(mask)
result = df_transactions[mask]
print('\nStep 2: filter with mask')
print(result)

We see that in the first part of the output, rows 5, 6, 7, 9, 18, and 19 hold true. That’s why the DataFrame contains only those records in the second part of the output.