Semantic Search: Data Strategy & Feature Engineering

Explore the essential components of a robust semantic search system by learning to construct relevance labels from implicit and explicit user signals, applying query preprocessing like spelling correction and rewriting, and generating high-quality training data for dual encoder models. Understand how to balance click data with dwell time and label quality to improve model training and system performance in real-world ML applications.

We'll cover the following...

Constructing relevance pairs from user signals
- Implicit signals from click logs and dwell time
- Explicit relevance labels
Query rewriting and spelling correction
- Spelling correction
- Query rewriting
Dual encoder training data construction
- Positive pair construction
- Hard negative mining
  - Mining strategies for hard negatives
  - The false negative problem
Bridging to model architecture

With the problem framing, metrics, and scale constraints for semantic search established, a critical truth emerges that separates strong interview candidates from average ones. The quality ceiling of any semantic search system is determined by its training data, not its model architecture. When an interviewer asks you to “design a semantic search system,” they expect a well-articulated data strategy before you ever sketch a model diagram. Google Search and Amazon product search both depend on massive click log pipelines to generate the supervision signal that trains their retrieval models. The model is only as good as the pairs it learns from.

This lesson covers three data pillars that form the foundation of a production semantic search system. First, you will learn how to construct relevance labels from implicit and explicit user signals. Second, you will see how query preprocessing tasks like spelling correction and query rewriting function as upstream ML systems that clean the input before it reaches the encoder. Third, you will walk through the construction of training data for dual encoder models, including the generation of high-quality positive pairs and the mining of hard negatives from user interaction data.

Constructing relevance pairs from user signals

Semantic search models learn from query-document relevance pairs. These pairs tell the model which documents should be retrieved for a given query. In production, these pairs come from three signal sources, each with different trade-offs in volume, noise, and reliability.

Implicit signals from click logs and dwell time

The most abundant source of relevance data is click logs. When a user issues a query and clicks a result, the system records an implicit positive pair. However, raw clicks are noisy. A phenomenon called position biasThe tendency for users to click on results ranked higher on the page regardless of their actual relevance, simply because those results are more visible. inflates clicks on top-ranked results. A document in position one receives far more clicks than an equally relevant document in position five. To correct for this, production systems apply ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Semantic Search: Data Strategy & Feature Engineering

Constructing relevance pairs from user signals

Implicit signals from click logs and dwell time