Inspect Data: Data Analysis

Learn why exploratory data analysis is so important for any data science project.

To warm up, let’s start with a few standard functionalities to get a better idea of what the DataFrame is all about. First, pick only one row from the DataFrame, for instance, StoreA. The reset_index() command resets the original index. This is not mandatory at this point, but it’s a good way to practice the index concept. Indexing will serve us well in the later lessons. Please note that column headings are case sensitive, so it must be df["Store"] instead of df["store"].

import pandas as pd
df = pd.read_csv('MoscowMcD.csv')
StoreSelect = df[df["Store"].isin(["StoreA"])].reset_index()

By the way, reset_index will add the original index number as a new column index.

It really pays to take our time at the beginning of the project to check the data quality. If we start right away, it can be even more difficult and time-consuming to correct possible errors in the data (and therefore possibly in the entire underlying concept or model) later on.

As the name suggests, EDA is a process of analyzing datasets to summarize their main characteristics, often with visual methods. It’s a critical process for understanding the story the data tells and for uncovering underlying patterns and relationships.

Descriptive statistics

Start with the simple data analysis first. Before we start applying basic statistics to the data, it’s a good idea to take a look at the data quality in advance. With the light function, isnull, we can check if there are any records with null values.

Fortunately, the data is complete. That’s a good start, but it doesn’t tell us anything about whether incorrect entries might have been made or not. For example, it could be that one longitude or latitude has an incorrect value. For example, we can check the minimum and maximum values of all longitudes and latitudes with the min and max functions.

import pandas as pd
df = pd.read_csv('MoscowMcD.csv')
# prints boolean values indicating whether each element in 'df' is null (True) or not (False)
# prints the minimum value in the 'lon' column of the df
# prints the minimum value in the 'lat' column of the df
# prints the maximum value in the 'lat' column of the df
# prints the maximum value in the 'lon' column of the df

But how are we supposed to judge if the longitude and latitude information is correct? We could now check that all numbers are at least within 0 to 180 degrees. But even if the numbers were valid for the coordinate system, this still says nothing about whether they are correct.

But surely there must be a better way than programming all these checks manually one by one, right?

The pandas_profiling package

A very handy tool to use to take a first look at our data is the pandas_profiling package. This library automatically generates a standardized univariate and multivariate report for data understanding. This comprehensive report of the dataset makes it easy to quickly get an overview and identify potential issues. It can help to detect outliers, correlations, missing values, and other patterns in the data. Additionally, it provides a convenient way to visualize the data for further exploration. With the profile_report function, we can also add a title to this report. The pandas_profiling package includes a minimal configuration in which the most expensive calculations, such as correlations and interactions between variables, are turned off. Since correlations between the coordinates have no relevance, we set minimal to True. The report can be output as an HTML file.

import pandas as pd
import pandas_profiling
import os
# Optionally we can add a title with the profile_report function
profile = pd.read_csv('MoscowMcD.csv').profile_report(title ='Spatial Data Quality and EDA', minimal=True)
# profile.to_file(output_file='DQ.html') command will save profile as an HTML file in current (usercode) directory
# Output HTML file in the code widget output
os.system("cat DQ.html")

The pandas_profiling package is especially useful when the data to be analyzed includes more than just the three columns and 13 rows. A quick look at the overview shows that there are no missing fields. Moreover, as expected, the lat and lon columns are of numerical data type, while the store column is categorical. But it’s also worth mentioning that pandas_profiling might not be the optimal tool for large datasets.

In this lesson, we learned how to conduct exploratory data analysis and why checking data quality is so important.