Exercise: Continuing Verification of Data Integrity

Explore techniques to verify data integrity by locating duplicate IDs using Boolean masks and pandas methods. Learn to identify invalid rows with all-zero features and clean your dataset by removing them. Gain practical skills to prepare data accurately for further analysis and modeling.

We'll cover the following...

Data integrity verification using Boolean masks
Try it yourself

Data integrity verification using Boolean masks

In this exercise, with our knowledge of Boolean arrays, we will examine some of the duplicate IDs we discovered. In Exercise: Verifying Basic Data Integrity, we learned that no ID appears more than twice. We can use this learning to locate duplicate IDs and examine them. Then we take action to remove rows of dubious quality from the dataset. Perform the following steps to complete the exercise:

Continuing where we left off in previous exercise, we need to get the locations of the id_counts Series, where the count is 2, to locate the duplicates. First, we load the data and get the value counts of IDs to bring us to where we left off in the last exercise lesson, then we create a Boolean mask locating the duplicated IDs with a variable called dupe_mask and display the first five elements. Use the following commands:
```
import pandas as pd
df = pd.read_excel('default_of_credit_card_clients'\
'__courseware_version_1_21_19.xls')
 
id_counts = df['ID'].value_counts()
id_counts.head()
    
dupe_mask = id_counts == 2
dupe_mask[0:5]
```
You will obtain the following output (note the ordering of IDs may be different in your ...

1.Introduction

2.Data Exploration and Cleaning

Mini Project

3.Introduction to scikit-learn and Model Evaluation

Project

Mini Project

4.Details of Logistic Regression and Feature Extraction

Mini Project

5.The Bias-Variance Trade-Off

Mini Project

6.Decision Trees and Random Forests

Mini Project

7.Gradient Boosting, XGBoost, and SHAP Values

Mini Project

Project

8.Test Set Analysis, Financial Insights, and Delivery to the Client

Mini Project

9.Appendix

Exercise: Continuing Verification of Data Integrity

Data integrity verification using Boolean masks