Embeddings in ML System Design
Explore how embeddings power ranking and retrieval in ML systems. Understand different embedding models, approximate nearest neighbor search, and critical issues like stale indexes. Learn zero-downtime strategies such as dual-write and shadow indexes that maintain production reliability while updating embedding models.
In any retrieval or ranking system, whether YouTube recommendations, Airbnb search, or e-commerce product matching, embeddings are the mechanism that converts raw content into dense vectors powering candidate retrieval and personalization. The previous lesson established feature categories and identified content embeddings as a key item feature. This lesson unpacks how those embeddings are generated, served, and maintained at scale, which is where most production complexity lives.
At interview level, candidates are expected to reason not just about which embedding model to use, but how embeddings flow through the entire system. That means storage, indexing, versioning, and rebuild. Two core design axes structure this lesson. The first is choosing and serving the right embedding model. The second is managing the embedding life cycle in production, including the stale index problem and zero-downtime rebuild strategies.
Proactively raising these operational concerns in an interview signals the kind of maturity that distinguishes senior candidates from those who stop at model selection.
Pre-trained vs. task-specific embeddings
Embedding models fall into two broad categories based on how they are trained and deployed:
Pre-trained embeddings are models like Word2Vec, GloVe, and general-purpose BERT, trained on large public corpora and usable out of the box without any domain-specific training.
Task-specific embeddings are models fine-tuned or trained from scratch on domain data, such as a two-tower model for YouTube video retrieval or a contrastive model for product search.
Each model family offers different trade-offs in representation quality, inference cost, and operational complexity.
Embedding model landscape
Four model families appear most frequently in system design discussions:
Word2Vec produces static word-level vectors. It is fast to train and serve, but captures no surrounding context, so the word “bank” gets the same vector whether it refers to a riverbank or a financial institution.
BERT generates contextual token-level embeddings by attending to the full input sequence. This yields richer representations but comes with significantly higher inference cost, making it expensive for real-time ...