Feature Stores

Explore the role of feature stores in machine learning systems. Understand how they ensure point-in-time correctness to prevent data leakage, minimize training-serving skew through unified feature definitions, and promote feature reuse via centralized registries. Gain insight into common tools and learn how to effectively discuss feature store architecture in ML system design interviews.

We'll cover the following...

Point-in-time correctness
- Why does leakage happen without temporal joins?
- Handling late-arriving events
Training-serving skew and feature reuse
- How skew enters the system
  - Detecting skew with distribution metrics
- Feature reuse through a centralized registry
Feast, Tecton, and Hopsworks
Discussing feature stores in interviews
Summary

In a fraud detection system at a company like Stripe or PayPal, dozens of models consume overlapping features such as transaction velocity, merchant risk scores, and user behavioral aggregates. Each team that needs these features typically re-implements the computation logic independently. One team writes a Spark job to compute a user’s 30-day transaction count for a credit risk model. Another team implements the same feature in a separate code path for a fraud classifier. The feature values diverge silently because the implementations handle nulls, time zones, or window boundaries differently. Model performance degrades in production, and the root cause is hard to trace because the feature lineage is split across implementations.

This is the exact problem that feature stores exist to solve. The previous lesson established that production ML systems combine batch, streaming, and request-time pipelines. A feature store is the centralized infrastructure layer where these heterogeneous pipelines converge, providing a unified interface for ingesting, storing, and serving features to both training and inference. It solves three core problems that interviewers expect you to articulate when discussing production ML infrastructure: point-in-time correctness, training-serving skew, and feature reuse.

Let’s examine the first and most critical correctness guarantee a feature store provides.

Point-in-time correctness

Point-in-time correctness is the guarantee that every training example uses only feature values that were actually available at the moment the label event occurred. No future data leaks into the past.

Why does leakage happen without temporal joins?

Consider training a credit default model. Each training example represents a loan application, and the label indicates whether the borrower defaulted. One of the features is the user’s 30-day transaction count. If the feature store performs a naive join on user_id without respecting timestamps, it retrieves the transaction count as of today, not as of the loan application date. That count includes post-application transactions, some of which may even reflect the default event itself. The model trains on information it could never access at serving time, inflating offline AUC while collapsing in production. ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Feature Stores

Point-in-time correctness

Why does leakage happen without temporal joins?