Low-Latency Serving at Scale

Explore how to design machine learning serving systems that meet strict latency SLAs, focusing on model caching, pre-computation, tail latency management, stateless server design, and load balancing. Understand how to scale inference servers horizontally with auto-scaling and cold start mitigation to ensure reliable performance under heavy production loads.

We'll cover the following...

Model caching and pre-computation
- Result and feature caching
- Pre-computation and hybrid fallback
Tail latency and response time SLAs
- Why p99 matters more than p50
- Tail latency amplification
Load balancing and horizontal scaling
- Stateless inference server design
- Load balancing strategies
  - Auto-scaling and cold start mitigation
Putting it together in an interview
Conclusion

A recommendation system that ranks candidates in 20 ms on your laptop can easily blow past 500 ms when 100,000 users hit it simultaneously. The gap between “works in development” and “works in production” is almost entirely a latency engineering problem. Consider YouTube’s ranking pipeline or Uber’s ETA prediction service, where a 50 ms delay at the 99th percentile degrades millions of user experiences every hour. This lesson addresses how to keep each stage of the retrieval and ranking funnel within its latency budget under real production load.

The strategies here rest on four pillars: model caching, pre-computation, tail latency SLAs, and horizontal scaling. Industry systems like Snowflake’s two-layer serving architecture, which separates a controller layer from inference engines, exemplify the microservices approach to this problem. This lesson focuses on system-level serving strategies. The next lesson on Model Optimization for Production covers hardware-level techniques like quantization and compilation that complement what you learn here.

Model caching and pre-computation

The fastest inference call is the one you never make. Two complementary strategies make this possible, and understanding when to apply each one is essential for any ML serving design.

Result and feature caching

Result caching stores inference outputs keyed by input features in low-latency stores like Redis or Memcached. When a user-item pair score has already been computed, the system returns the cached value in sub-millisecond time instead of running the model again. The effectiveness of this approach depends on the cache hit ratethe fraction of incoming requests that find a valid cached result, directly determining how much inference load the cache absorbs.. A high hit rate means most requests skip inference entirely.

Cache entries need a TTL (time-to-live) policy that balances freshness against hit rate. Short TTLs keep results fresh but reduce hit rates. Long TTLs improve hit rates but serve stale scores. When user context changes rapidly, such as a user switching locations in a ride-hailing app, cache invalidation becomes complex because the cached score no longer reflects the current state.

Feature caching operates one layer deeper. Instead of caching final scores, it caches expensive intermediate feature computations. Uber’s ETA system, for example, caches precomputed geographic features to avoid redundant feature store lookups on every request. This reduces per-request latency even when the final inference must run online.

Practical tip: In an interview, specify which layer you are caching (features vs. scores) and justify the TTL window based on how quickly the underlying signal changes.

Pre-computation and hybrid fallback

...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Low-Latency Serving at Scale

Model caching and pre-computation

Result and feature caching

Pre-computation and hybrid fallback