Search⌘ K
AI Features

Identifying Data Quality Issues

Explore techniques to identify common data quality issues such as duplicates and missing values in machine learning datasets. Understand how to use Python libraries like pandas and scikit-learn to profile and clean data effectively. This lesson helps you take foundational steps to ensure trustworthy inputs and improve model performance by applying practical data quality checks and thoughtful remediation strategies.

Data quality issues can silently undermine the reliability of any machine learning workflow. Before feature engineering or model training, practitioners must systematically identify and address problems such as duplicates and missing values. These issues, if left unchecked, can distort statistical summaries, introduce bias, and even cause data leakage. In production environments, robust data profiling is not optional. It is a foundational step that ensures downstream models receive trustworthy inputs. Python libraries such as pandas and scikit-learn offer efficient, reproducible tools for this purpose, making them essential for any applied ML pipeline.

Introduction to data quality in machine learning

Data engineering is the first stage in the ML life cycle, where raw data is ingested, transformed, and validated. At this point, data quality checks play a crucial role. Duplicates and missing values are two of the most common issues encountered in tabular datasets. If not detected early, they can propagate errors throughout the pipeline, affecting everything from exploratory data analysis (EDA) to model deployment.

Pandas provides a rich set of functions for data manipulation and profiling, while scikit-learn offers preprocessing utilities that help standardize data for modeling. By integrating these tools into your workflow, you can automate the detection of data quality issues and ensure consistency across experiments.

Practical tip: Automating data profiling checks in your data ingestion pipeline reduces manual errors and speeds up iteration cycles.

Next, clarify what duplicates and missing values mean in the context of machine learning datasets.

Understanding duplicates and missing values

In tabular data, duplicates refer to repeated entries that can be exact row copies or partial duplicates where only some columns match. Missing values occur when data points are absent, represented by special markers such as NaN (Not a Number), None, or even empty strings.

These issues can appear in several forms:

  • Exact row duplicates: Entire rows that are identical across all columns

  • Partial duplicates: Rows that match on a subset of columns, often due to data entry errors or merging datasets

  • NaN/null values: Explicitly missing entries, typically encoded as np.nan or None

  • Empty strings: Cells that appear blank but are technically present in the dataset

Failing to address these problems can lead to:

  • Distorted statistical summaries: Duplicates inflate counts and skew means or medians

  • Model bias: Missing values can cause models to learn from incomplete or unrepresentative data

  • Data leakage: Duplicates in both training and test sets can artificially boost performance metrics

Attention: In a real-world fraud detection project, undetected duplicates in the training set led to a model that simply memorized repeated transactions, resulting in poor generalization to new data.

Understanding these risks sets the stage for systematic detection using Python tools.

A modern data pipeline highlighting profiling and data quality checks
A modern data pipeline highlighting profiling and data quality checks

Detecting data quality issues with pandas

Pandas streamlines the process of identifying duplicates and missing values through vectorized operations, making it suitable for both small and large datasets. Profiling your data with these methods is a best practice in the EDA phase of the ML life cycle.

Key pandas methods for data profiling

  • .duplicated(): Returns a Boolean Series indicating whether each row is a duplicate of a previous row

  • .drop_duplicates(): Removes duplicate rows, keeping the first or last occurrence

  • .isnull(): Detects missing values, returning a Boolean DataFrame

  • .info(): Summarizes the DataFrame, showing counts of non-null entries per column

Note: Vectorized operations in pandas allow you to scan entire datasets for issues in a single line of code, which is much faster and less error-prone than looping through rows.

Best practices for profiling

  • Generate summary statistics: Use .describe() and .info() to get a quick overview of data completeness

  • Visualize missingness: Tools such as seaborn.heatmap or missingno.matrix can reveal patterns in missing data

  • Automate checks: Integrate these methods into reusable scripts or notebooks to ensure reproducibility

Practical tip: Always save profiling reports as part of your data versioning strategy. This enables traceability and easier debugging in production pipelines.

Let’s see these methods in action with a hands-on code example.

Python
import pandas as pd
# Define sample DataFrame inline
df = pd.DataFrame({
'user_id': [1, 2, 2, 3, 4],
'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 30, None, 22],
'income': [50000, 60000, 60000, 55000, None]
})
# Detect duplicate rows
num_duplicates = df.duplicated().sum()
print("Duplicate rows:", num_duplicates)
# Remove duplicates for further analysis
df_clean = df.drop_duplicates()
# Count missing values per column
missing_per_col = df.isnull().sum()
print("Missing values per column:\n", missing_per_col)
print("\n")
# Summary report: structure and statistics
print("DataFrame info:")
df.info()
print("\n")
print("Summary statistics:")
print(df.describe(include='all'))

After running these checks, the next step is to interpret the results and decide on an appropriate remediation strategy.

Interpreting and prioritizing data issues

Interpreting data quality reports involves more than just counting issues. You must assess the impact of each problem and prioritize actions based on the dataset’s characteristics and modeling objectives.

Consider these factors when deciding how to handle duplicates and missing values:

  • Dataset size: Removing rows in a small dataset may lead to significant information loss, while in large datasets, it may have minimal impact

  • Feature importance: Missing values in critical features require more careful handling than those in less important columns

  • Downstream goals: If the model will be used for real-time predictions, imputation methods must be fast and reliable

Common trade-offs include:

  • Removing vs. aggregating duplicates: Removal is simple but may discard useful information. Aggregation (for example, averaging) can preserve data but risks introducing artifacts.

  • Deleting vs. imputing missing values: Deletion is safe for small amounts of missing data. Imputation preserves dataset size but can introduce bias if not done carefully.

Attention: Overzealous removal of data can lead to underfitting or loss of rare but important patterns.

To help you compare strategies, the following table summarizes the main options.

Comparison of Common Data Preprocessing Methods

Method

Impact on Data Integrity

Computational Cost

Typical Use Cases

Duplicate Removal

May discard valuable repeated data

Low

Large datasets, clear duplicates

Duplicate Aggregation

Preserves information, risk of artifacts

Medium

Time-series, grouped data

Missing Value Deletion

Reduces data size, risk of bias

Low

Small % missing, non-critical features

Mean/Median Imputation

Can introduce bias, preserves size

Low

Numerical features, moderate missingness

Model-Based Imputation

More accurate, risk of overfitting

High

Critical features, high missingness

With these trade-offs in mind, you can make informed decisions that balance data integrity and modeling performance.

Conclusion

Systematic identification of duplicates and missing values is a non-negotiable step in any applied machine learning workflow. Pandas provides efficient, reproducible tools for profiling datasets and surfacing these issues early. However, the real value lies in interpreting the results thoughtfully and choosing remediation strategies that align with your modeling goals and data constraints. Robust data profiling not only prevents downstream errors but also lays the groundwork for advanced cleaning and feature engineering, which you will explore in the next lessons.

Note: Investing time in data profiling pays dividends throughout the ML life cycle, from EDA to deployment and monitoring.