Data Sources and Collection Strategies at Scale
Explore how to develop robust data strategies for machine learning systems at scale. Understand the trade-offs between implicit and explicit signals, design event-driven data collection pipelines using technologies like Kafka and Kinesis, and incorporate external data sources while managing legal and ethical constraints. This lesson prepares you to confidently discuss data sourcing and infrastructure in ML system design interviews, ensuring your models are built on high-quality, compliant data.
You’re in a system design interview at a MAANG company. The prompt is straightforward: design a recommendation system for a streaming platform serving hundreds of millions of users. You start sketching a model architecture, but the interviewer stops you. “Before we talk about models, walk me through your data strategy. What signals are you collecting, how are you collecting them, and what does that cost you?” This moment separates prepared candidates from everyone else. The model you can train is fundamentally bounded by the data you can collect. A sophisticated transformer-based ranker is useless if it trains on noisy, incomplete, or legally questionable signals. Data sourcing is not a preliminary step you rush through. It is the design decision that constrains every downstream choice, from your label space and objective function to your serving latency and compliance posture.
This lesson covers the four pillars of ML data strategy. You will distinguish implicit and explicit signals and reason about their trade-offs, design event-driven collection architectures with systems such as Kafka or Kinesis, evaluate when to use external data sources, and account for the legal, privacy, and ethical constraints that govern data collection at scale. The next lesson covers the cold start problem: what happens when these data sources are missing or sparse. The strategies in this lesson give you the baseline for that discussion.
Implicit vs. explicit signals
Every ML system consumes signals, and those signals fall into two broad categories based on how they are generated.
Implicit signals are behavioral data passively captured from user interactions. The user does not intend to provide feedback; the system observes what they do. Think of it like a store owner watching which aisles customers linger in vs. which they skip. No one fills out a form, but the behavior is informative. Examples include clicks, dwell time, scroll depth, purchases, skips, and add-to-cart events.
Explicit signals are intentional user feedback. The user deliberately communicates a preference. Examples include star ratings, thumbs-up or thumbs-down, written reviews, and survey responses.
The fundamental trade-off between these two categories governs how you design labels for your ML models.
Volume vs. sparsity: Every user who visits a page generates implicit signals, but fewer than 1% of YouTube views receive an explicit rating. Implicit signals give you coverage; explicit signals give you precision.
Noise vs. clarity: A long dwell time on a video could mean deep engagement, or it could mean the user fell asleep. A five-star rating unambiguously communicates satisfaction. Implicit signals are abundant but ambiguous; ...