Training Data Generation

Explore the process of generating and cleaning training data critical to fraud detection systems. Understand how to gather diverse data sources, engineer effective features, and handle class imbalance to improve model performance and resilience against evolving fraud patterns.

We'll cover the following...

Data sources
Preprocessing and cleaning
Feature engineering
- Example: Rolling transaction count
- Example: Deviation from usual spending
Handling imbalanced data
- Weighted loss formula
Data labeling and quality
Interview tips
Interview questions

Training data is the true engine of any fraud detection system. While model architectures often receive the spotlight, it is the quality, diversity, and structure of the training data that determine performance in production. Fraud detection is particularly reliant on robust data pipelines, as fraudulent events are rare, evolving, and often subtle in nature. If training data is messy, biased, or incomplete, even the most advanced model will struggle.

Effective training data generation requires collecting the right inputs, carefully cleaning and preprocessing them, engineering meaningful features, addressing class imbalance, and ensuring label accuracy. This lesson guides you through each of these steps, demonstrating how they come together to form a robust, continuously improving data pipeline.

With that context, let’s begin at the source of all learning: the data itself.

Data sources

The first step in generating training data is gathering diverse, representative signals from multiple sources. Fraud detection thrives on signal diversity; no single data source is sufficient on its own.

Transactional data forms the backbone of the dataset. This includes transaction amounts, timestamps, merchants, locations, payment methods, and account identifiers. Many fraud signals emerge not from individual transactions, but from changes in these patterns, such as unusual velocity or merchant category shifts.

User behavioral data adds critical context. Device fingerprints, IP addresses, login geographies, session timing, and interaction patterns often expose fraud that transactional data alone cannot. Behavioral signals are particularly valuable for detecting account takeovers and automated attacks.

External data sources, such as blacklists, sanctions lists, known fraud rings, or third-party risk scores, further enrich the dataset with information unavailable from first-party logs.

In cases where fraud is extremely rare, teams may introduce synthetic examples to supplement training. However, these must be used cautiously. Poorly designed synthetic data can introduce unrealistic patterns that cause models to overfit to artificial signals rather than real fraud behavior.

Finally, timeliness matters. Features used for real-time fraud detection must be derived from data that is fresh and consistently available at prediction time; stale signals can silently degrade model performance.

Preprocessing and cleaning

Raw data is rarely ready for training. Preprocessing ensures the model learns from meaningful patterns rather than noise, but in fraud detection, this step requires extra care.

Typical preprocessing steps include handling missing values, normalizing numeric fields, encoding categorical variables, and removing corrupted records. However, what looks like an anomaly or inconsistency may actually be a strong fraud signal. Aggressive cleaning can unintentionally erase rare but important patterns.

This creates a key trade-off:

Over-cleaning risks removing genuine fraud signals.
Under-cleaning allows noise and errors to confuse the model.

Special attention is required for missing or inconsistent identifiers such as device IDs or user IDs. Instead of blindly dropping these records, it is often better to encode “missingness” explicitly, as missing identifiers themselves can be indicative of suspicious behavior.

1.Introduction

2.Practical ML Techniques/Concepts

Breakout Session

3.Search Ranking

Breakout Session

4.Feed Based System

5.Recommendation System

Breakout Session

Mock Interview

6.Self-Driving Car: Image Segmentation

7.Entity Linking System

8.Ad Prediction System

Breakout Session

9.Fraud Detection System

Mock Interview

10.Hate Speech Detection

Mock Interview

11.Dynamic Pricing Engine

Mock Interview

Mock Interview

Mock Interview

Training Data Generation

Data sources

Preprocessing and cleaning

Feature engineering