Search⌘ K
AI Features

Semantic Search: Data Strategy & Feature Engineering

Explore the essential components of a robust semantic search system by learning to construct relevance labels from implicit and explicit user signals, applying query preprocessing like spelling correction and rewriting, and generating high-quality training data for dual encoder models. Understand how to balance click data with dwell time and label quality to improve model training and system performance in real-world ML applications.

With the problem framing, metrics, and scale constraints for semantic search established, a critical truth emerges that separates strong interview candidates from average ones. The quality ceiling of any semantic search system is determined by its training data, not its model architecture. When an interviewer asks you to “design a semantic search system,” they expect a well-articulated data strategy before you ever sketch a model diagram. Google Search and Amazon product search both depend on massive click log pipelines to generate the supervision signal that trains their retrieval models. The model is only as good as the pairs it learns from.

This lesson covers three data pillars that form the foundation of a production semantic search system. First, you will learn how to construct relevance labels from implicit and explicit user signals. Second, you will see how query preprocessing tasks like spelling correction and query rewriting function as upstream ML systems that clean the input before it reaches the encoder. Third, you will walk through the construction of training data for dual encoder models, including the generation of high-quality positive pairs and the mining of hard negatives from user interaction data.

Constructing relevance pairs from user signals

Semantic search models learn from query-document relevance pairs. These pairs tell the model which documents should be retrieved for a given query. In production, these pairs come from three signal sources, each with different trade-offs in volume, noise, and reliability.

Implicit signals from click logs and dwell time

The most abundant source of relevance data is click logs. When a user issues a query and clicks a result, the system records an implicit positive pair. However, raw clicks are noisy. A phenomenon called position biasThe tendency for users to click on results ranked higher on the page regardless of their actual relevance, simply because those results are more visible. inflates clicks on top-ranked results. A document in position one receives far more clicks than an equally relevant document in position five. To correct for this, production systems apply ...