Labeling Strategies and the Active Learning Flywheel

Understand various labeling strategies such as human annotation, weak supervision, and self-supervised learning. Learn how active learning sampling techniques optimize annotation efforts and how a data flywheel architecture continuously improves ML model training through feedback loops and quality control.

We'll cover the following...

Human-in-the-loop annotation pipelines
Weak supervision and programmatic labeling
- How Snorkel combines noisy votes
Self-supervised and semi-supervised approaches
- Self-supervised learning
- Semi-supervised learning
Label noise and model quality degradation
Active learning sampling strategies
The data flywheel architecture
Designing the flywheel in interviews

When Uber’s fraud detection system processes millions of daily transactions, ground truth about whether a charge was actually fraudulent can take days or weeks to materialize. Expert annotators who can make that call cost hundreds of dollars per hour. Meanwhile, the model needs labeled data now. This tension between label quality, cost, latency, and scale is not a preprocessing detail. It is a core system design decision. In MAANG interviews, candidates who treat labeling as a pipeline architecture problem, complete with cost budgets, quality monitoring, and feedback loops, demonstrate the kind of Staff+ thinking that separates senior engineers from everyone else.

This lesson covers the main labeling strategies, including human annotation, programmatic labeling, and self-supervised approaches. You will then examine how label noise affects model quality, go deeper on active learning sampling strategies, and connect these techniques through a data flywheel architecture: a closed-loop system where production feedback helps improve future training data.

Human-in-the-loop annotation pipelines

Human annotation remains the gold standard for label quality. But a production annotation pipeline involves far more than handing spreadsheets to contractors. The system must orchestrate several tightly coupled components.

Task design: The annotation interface must present examples in a format that minimizes cognitive load and ambiguity. A well-designed task for image classification, for instance, shows the image alongside clear category definitions and edge-case examples.
Annotator selection: Different tasks demand different expertise. Medical imaging requires board-certified radiologists, while content moderation can leverage trained crowdsourced workers at lower cost.
Inter-annotator agreement: The system measures consistency across annotators using metrics like Cohen's kappaA statistical measure of agreement between two raters that accounts for agreement occurring by chance, ranging from -1 (complete disagreement) to 1 (perfect agreement).. When kappa falls below a threshold, the task design or guidelines need revision.
Adjudication workflows: When annotators disagree, the system routes the example to a senior reviewer or applies majority voting to resolve the conflict.
Quality control loops: Gold-standard examples with known labels are injected into the annotation stream to continuously monitor annotator accuracy.

Google’s Search Quality Raters illustrate this at scale. Thousands of trained raters evaluate search result relevance using detailed guidelines, and their judgments feed directly into ranking model training. Yet even Google cannot label every query-document pair manually. The fundamental trade-off is annotation quality vs. annotation velocity, and human pipelines alone cannot scale to the data volumes modern ML systems demand.

This scalability gap motivates programmatic and automated labeling approaches. The following table provides a quick-reference comparison of the major strategies.

Comparison of Labeling Strategies in Machine Learning

Strategy	Mechanism	Label Quality	Scalability	Cost	Best Use Case
Human Annotation (Expert)	Manual review by domain experts	Very High	Low	High	Safety-critical domains (e.g., medical imaging)
Crowdsourced Annotation	Distributed workers via platforms (e.g., MTurk)	Moderate	Moderate	Moderate	Content moderation, image tagging
Weak Supervision (Snorkel)	Programmatic labeling functions combined by generative model	Moderate	High	Low	Text classification, entity extraction
Self-Supervised Pretext Tasks	Model generates supervision from data structure (e.g., masked tokens)	N/A	Very High	Very Low	Pretraining general-purpose embeddings
Semi-Supervised (Pseudo-Labels)	Model labels high-confidence unlabeled examples	Variable	High	Low	Expanding labeled datasets with existing seed labels

With this landscape in view, let us examine the most influential programmatic approach in production ML systems today.

Weak supervision and programmatic labeling

Weak supervision replaces or augments manual annotation with noisy, programmatic signals. Instead of asking a human to label each example, engineers write labeling functions.Small programs or heuristic rules that each assign a noisy label (or abstain) for a given data point, serving as imperfect voters whose outputs are later combined. These functions encode domain knowledge as code. ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Labeling Strategies and the Active Learning Flywheel

Human-in-the-loop annotation pipelines

Comparison of Labeling Strategies in Machine Learning

Weak supervision and programmatic labeling