Data Cleaning Techniques
Explore key data cleaning techniques to prepare messy datasets for machine learning. Learn how to handle missing values, duplicates, structural errors, and outliers. Understand automation steps in data cleaning workflows to ensure high data quality and reproducibility in real-world applications.
No analysis or model can succeed without clean data. In this lesson, we’ll learn how to structure your thinking and solutions for common interview questions around cleaning, transforming, and validating data in both real-time and offline workflows. Let’s get started.
Handling messy data
You’re given a raw dataset filled with inconsistencies—missing values, outliers, structural errors, and duplications. An interviewer asks how you’d clean and prepare this data for analysis while preserving data quality.
What is your process for dealing with messy data?
This question is frequently asked at LinkedIn, Spotify, and Stripe for analytics or data science roles especially any company working with user-generated data (e.g., edtech platforms, surveys, and IoT).
Sample answer
Let’s explore a sample answer that touches upon the key dimensions of “messy” data and emphasizes a multi-step process. Ideally, your approach also highlights attributes such as data integrity and quality, and mentions specific techniques that you use for the types of data you are experienced in (tabular, image, etc.).
Missing data:
Start by identifying any missing values within the dataset using techniques like NA Detection.
Then apply data imputation techniques such as mean/median/mode imputation, or k-nearest neighbors imputation to fill in the missing values with plausible estimates.
Duplicate data:
Use duplicate detection algorithms to identify any duplicate records.
Then typically apply fuzzy matching or record linkage to see how successful they are at eliminating these duplicates, ensuring data integrity, or moving to other techniques if these aren’t able to accurately catch all the duplicates.
Data from different sources:
Perform data integration using ETL (extract, transform, load) processes to consolidate data from various sources.
To ensure consistency, standardize the data formats using schema matching and data mapping techniques.
Structural errors:
Meticulously check structural errors, such as incorrect data formats, using data profiling tools.
Then apply data transformation techniques, including parsing and reformatting, to rectify these errors and ensure data consistency.
Outliers:
Identify outliers using statistical methods like z-score and IQR (interquartile range) analysis, or visualizations such as box plots.
Depending on the context, I address these outliers by techniques like capping, transformation, or even exclusion if they may skew the analysis.
You can also apply visualization techniques as they can be useful for identifying patterns, anomalies, and potential issues related to missing data, including:
Heatmaps to highlight missing data patterns across a dataset.
Bar plots to visualize the count of missing values in each column, and identify which features have the most missing data, for example, as a statistical method mentioned above.
Box plots to reveal outliers and anomalies in the data, and identify unusual distributions that may indicate data quality issues.
Histograms to show the distribution of data, including missing values. They help in understanding the overall data distribution while spotting any irregularities.
Scatter Plots to visualize relationships between variables and identify any anomalies or patterns in the data.
The figure below displays the sample workflow of the described approach, where steps can be rearranged or modified to meet the use case (e.g., removing outliers during the duplicate identification and removal stage if the use case/workflow permits).
This approach to handling messy data involves thorough identification, correction, and validation processes using specific techniques to ensure the dataset is robust, accurate, and ready for analysis.
Steps to automate data cleaning
Let’s explore a variation of this question, focused on data automation. You’re asked how you would automate common data cleaning tasks in a production pipeline. What steps offer the most value, and how would you implement them?
Frequently asked by: Any team working with pipelines, real-time dashboards, or automated reporting.
Sample answer
Automated data cleaning ensures high-quality data, which is crucial for reliable models. It saves time by reducing manual tasks, ensures consistent cleaning steps for reproducibility, and scales to handle large data volumes. This is essential for real-time processing in applications like fraud detection or IoT monitoring. Understanding these principles helps data scientists build efficient, robust solutions for real-world data challenges, which is why data scientists and machine learning practitioners can expect to come across questions on these topics.
Here’s a sample answer that allows you to demonstrate deeper technical proficiency with an interviewer on this topic.
The steps in data cleaning that typically provide the most value when automated are:
Handling missing data: This means automatically filling in or removing missing values to ensure dataset completeness.
Data transformation: This includes standardizing formats, scaling, and normalizing data for consistency.
Outlier detection and removal: This includes identifying and addressing outliers to maintain data integrity.
Data validation: This includes ensuring data meets predefined criteria and correcting errors.
Feature engineering: This includes creating new features from existing data to enhance model performance.
Make sure to mention that you will incorporate these steps throughout the data and, if applicable, in the machine learning workflow. For example, missing data and data transformation can occur before the data is officially inserted into a production database. Automated outlier detection and data validation checks can happen at periodic stages across databases, data lakes, and other data stores, depending on importance of data.
Feature engineering can be incorporated with an engineer’s review to ensure the constructed features make sense for the problem statement.
Automated data cleaning in pseudocode
Design a pseudocode workflow to automate data cleaning in a real-world ETL pipeline using Python and standard libraries.
In pseudocode, could you show me how would you implement a pipeline for automated data cleaning? What libraries would you use?
Frequently asked in ETL/Data Engineering interviews at Shopify, Uber, Netflix especially during any discussion around pipeline design or ML data preprocessing.
Sample answer
You may choose to answer this question in a couple of ways. A great way to answer this question is to address how ETL (extract, transform, load) processes are applied for automating data cleaning.
Your answer can cover:
Phase 1: Extract
Data collection: Gather data from various sources such as databases, APIs, or flat files.
Example: Use Python libraries like
pandasto read data from CSV files orrequeststo fetch data from APIs.
Phase 2: Transform
Data cleaning: Apply automated cleaning steps such as handling missing data, data transformation, outlier detection, and data validation.
Example: Use
pandasfor data manipulation,scikit-learnfor scaling and normalization, and custom functions for outlier detection.
Phase 3: Load
Data storage: Store the cleaned data into a target database or data warehouse for further analysis.
Example: Use
SQLAlchemyto load data into an SQL database orpandasto write data to a new CSV file.
import pandas as pdfrom sqlalchemy import create_engine# Extract data into pandas dataframedef extract_data(file_path):return pd.read_csv(file_path)# Transformdef transform_data(df):# Handle missing datadf.fillna(method='ffill', inplace=True)# Data transformation (scaling)df['scaled_column'] = (df['column'] - df['column'].mean()) / df['column'].std()# Outlier detection and removalQ1 = df['column'].quantile(0.25)Q3 = df['column'].quantile(0.75)IQR = Q3 - Q1df = df[~((df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR)))]return df# Loaddef load_data(df, db_uri, table_name):engine = create_engine(db_uri)df.to_sql(table_name, engine, if_exists='replace', index=False)# Example usagefile_path = 'data.csv'db_uri = 'sqlite:///cleaned_data.db'table_name = 'cleaned_table'# ETL Processdata = extract_data(file_path)cleaned_data = transform_data(data)load_data(cleaned_data, db_uri, table_name)
This example demonstrates a simple ETL pipeline where we:
Extract data by reading data from a CSV file.
Transform data by cleaning it—filling missing values, scaling a column, and removing outliers.
Load data and store the cleaned data in an SQLite database.
Data quality metrics for speech detection
Imagine you’re part of a team developing a machine learning algorithm that detects when people are speaking. Ensuring high data quality is crucial for the model’s accuracy and reliability. What top data quality metrics have you seen in practice for this use case?
Sample answer
Your answer should highlight your experience with data quality metrics and how they evaluate the integrity and usefulness of data.
Some essential metrics that you will most likely be expected to cover in your answer include:
Completeness: This indicates the proportion of values in a dataset that aren't missing.
Accuracy: This measures how many values in a dataset are correct.
Consistency: This reflects how uniform the data is across various sources and formats.
Validity: This assesses how well data adheres to established rules and constraints.
Uniqueness: This measures how distinct each record in a dataset is.
Timeliness: This gauges how up-to-date and current the data is.
Let’s explore how you may connect these metrics to a project you’ve contributed to involving data analysis and machine learning. Here, we’re considering a speech detection project, where a machine learning algorithm was developed to detect when people are speaking.
Completeness: To check for this metric, you would ensure your dataset contains all necessary audio samples, including various accents, speech patterns, and environmental noises, with minimal missing data.
Accuracy: Here, you verify that your machine learning speech detection model correctly identifies when people are speaking and accurately transcribes the spoken words.
Consistency: You need to confirm that the speech detection data is consistent across different recordings, sources, and formats, ensuring uniformity in how speech is detected and transcribed.
Validity: You also need to ensure that all entries in the dataset are conforming to predefined rules in elements such as valid timestamps, accurate speaker identification, and correct speech-to-text transcription formats.
Uniqueness: For this, you need to implement each audio segment and its corresponding transcription having uniqueness in a certain feature, avoiding duplicates that could skew the analysis or increase unnecessary redundancy.
Timeliness: For this metric, you would validate and comment on making sure your speech detection data is current, reflecting recent recordings and updated language models, to maintain its relevance for your project.
Data cleaning techniques
You’re asked to walk through three to four core data cleaning techniques and provide pseudocode in Python for each.
Sample answer
Here are a few data cleaning techniques that you’ll want to cover in your answer:
Data standardization: Ensure that the data is converted into a uniform format, such as standardizing date formats, which is particularly important if you have data collected in regions where date formats can differ.
Data validation: Verify the data against predefined rules and constraints, for example, ensuring postal codes are valid. This point allows you to elaborate on more detailed possibilities, for example, validating email addresses against records in the company system if feasible.
Data imputation: Substitute missing values with estimates derived from other data points, such as the average or median of the data.
Data transformation: Modify data to make it more suitable for analysis, such as converting categorical data to numerical form, or breaking down complex data into simpler categories/bins to understand relationships between features more easily.
Below is some Python pseudocode for each technique, for the purposes of demonstrating to an interviewer how you may implement them.
# Data Standardization# Convert all date formats to a uniform format (e.g., YYYY-MM-DD)def standardize_dates(data):for record in data:record['date'] = convert_to_standard_format(record['date'])return data# Data Validation# Verify data against predefined rules (e.g., valid postal codes)def validate_data(data):for record in data:if not is_valid_postal_code(record['postal_code']):raise ValueError("Invalid postal code found")# Validate email addresses if neededif not is_valid_email(record['email']):raise ValueError("Invalid email address found")return data# Data Imputation# Fill missing values with the average value of the respective columndef impute_missing_values(data):for column in data.columns:if data[column].isnull().sum() > 0:average_value = data[column].mean()data[column].fillna(average_value, inplace=True)return data# Data Transformation# Convert categorical data to numerical form# to agree with pre-defined schemadef transform_data(data):for column in data.columns:if data[column].dtype == 'object': # Assuming 'object' dtype for categorical datadata[column] = convert_categorical_to_numerical(data[column])return data# Main function to process data and apply data cleaning techniquesdef process_data(data):data = standardize_dates(data)data = validate_data(data)data = impute_missing_values(data)data = transform_data(data)return data