Data Sources and Collection Strategies at Scale

Explore how to develop robust data strategies for machine learning systems at scale. Understand the trade-offs between implicit and explicit signals, design event-driven data collection pipelines using technologies like Kafka and Kinesis, and incorporate external data sources while managing legal and ethical constraints. This lesson prepares you to confidently discuss data sourcing and infrastructure in ML system design interviews, ensuring your models are built on high-quality, compliant data.

We'll cover the following...

Implicit vs. explicit signals
Event-driven logging architectures
- The canonical ingestion pipeline
  - Why Kafka dominates
  - The schema registry problem
External data and third-party sources
Designing your data strategy for interviews

You’re in a system design interview at a MAANG company. The prompt is straightforward: design a recommendation system for a streaming platform serving hundreds of millions of users. You start sketching a model architecture, but the interviewer stops you. “Before we talk about models, walk me through your data strategy. What signals are you collecting, how are you collecting them, and what does that cost you?” This moment separates prepared candidates from everyone else. The model you can train is fundamentally bounded by the data you can collect. A sophisticated transformer-based ranker is useless if it trains on noisy, incomplete, or legally questionable signals. Data sourcing is not a preliminary step you rush through. It is the design decision that constrains every downstream choice, from your label space and objective function to your serving latency and compliance posture.

This lesson covers the four pillars of ML data strategy. You will distinguish implicit and explicit signals and reason about their trade-offs, design event-driven collection architectures with systems such as Kafka or Kinesis, evaluate when to use external data sources, and account for the legal, privacy, and ethical constraints that govern data collection at scale. The next lesson covers the cold start problem: what happens when these data sources are missing or sparse. The strategies in this lesson give you the baseline for that discussion.

Implicit vs. explicit signals

Every ML system consumes signals, and those signals fall into two broad categories based on how they are generated.

Implicit signals are behavioral data passively captured from user interactions. The user does not intend to provide feedback; the system observes what they do. Think of it like a store owner watching which aisles customers linger in vs. which they skip. No one fills out a form, but the behavior is informative. Examples include clicks, dwell time, scroll depth, purchases, skips, and add-to-cart events.

Explicit signals are intentional user feedback. The user deliberately communicates a preference. Examples include star ratings, thumbs-up or thumbs-down, written reviews, and survey responses.

The fundamental trade-off between these two categories governs how you design labels for your ML models.

Volume vs. sparsity: Every user who visits a page generates implicit signals, but fewer than 1% of YouTube views receive an explicit rating. Implicit signals give you coverage; explicit signals give you precision.
Noise vs. clarity: A long dwell time on a video could mean deep engagement, or it could mean the user fell asleep. A five-star rating unambiguously communicates satisfaction. Implicit signals are abundant but ambiguous; ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Data Sources and Collection Strategies at Scale

Implicit vs. explicit signals