Embeddings in ML System Design

Explore how embeddings power ranking and retrieval in ML systems. Understand different embedding models, approximate nearest neighbor search, and critical issues like stale indexes. Learn zero-downtime strategies such as dual-write and shadow indexes that maintain production reliability while updating embedding models.

We'll cover the following...

Pre-trained vs. task-specific embeddings
- Embedding model landscape
- The freeze vs. fine-tune decision
ANN search for embedding retrieval
- Index families and trade-offs
The stale index problem
Zero-downtime rebuild strategies
Summary

In any retrieval or ranking system, whether YouTube recommendations, Airbnb search, or e-commerce product matching, embeddings are the mechanism that converts raw content into dense vectors powering candidate retrieval and personalization. The previous lesson established feature categories and identified content embeddings as a key item feature. This lesson unpacks how those embeddings are generated, served, and maintained at scale, which is where most production complexity lives.

At interview level, candidates are expected to reason not just about which embedding model to use, but how embeddings flow through the entire system. That means storage, indexing, versioning, and rebuild. Two core design axes structure this lesson. The first is choosing and serving the right embedding model. The second is managing the embedding life cycle in production, including the stale index problem and zero-downtime rebuild strategies.

Proactively raising these operational concerns in an interview signals the kind of maturity that distinguishes senior candidates from those who stop at model selection.

Pre-trained vs. task-specific embeddings

Embedding models fall into two broad categories based on how they are trained and deployed:

Pre-trained embeddings are models like Word2Vec, GloVe, and general-purpose BERT, trained on large public corpora and usable out of the box without any domain-specific training.
Task-specific embeddings are models fine-tuned or trained from scratch on domain data, such as a two-tower model for YouTube video retrieval or a contrastive model for product search.

Each model family offers different trade-offs in representation quality, inference cost, and operational complexity.

Embedding model landscape

Four model families appear most frequently in system design discussions:

Word2Vec produces static word-level vectors. It is fast to train and serve, but captures no surrounding context, so the word “bank” gets the same vector whether it refers to a riverbank or a financial institution.
BERT generates contextual token-level embeddings by attending to the full input sequence. This yields richer representations but comes with significantly higher inference cost, making it expensive for real-time ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Embeddings in ML System Design

Pre-trained vs. task-specific embeddings

Embedding model landscape