Low-Latency Serving at Scale
Explore how to design machine learning serving systems that meet strict latency SLAs, focusing on model caching, pre-computation, tail latency management, stateless server design, and load balancing. Understand how to scale inference servers horizontally with auto-scaling and cold start mitigation to ensure reliable performance under heavy production loads.
A recommendation system that ranks candidates in 20 ms on your laptop can easily blow past 500 ms when 100,000 users hit it simultaneously. The gap between “works in development” and “works in production” is almost entirely a latency engineering problem. Consider YouTube’s ranking pipeline or Uber’s ETA prediction service, where a 50 ms delay at the 99th percentile degrades millions of user experiences every hour. This lesson addresses how to keep each stage of the retrieval and ranking funnel within its latency budget under real production load.
The strategies here rest on four pillars: model caching, pre-computation, tail latency SLAs, and horizontal scaling. Industry systems like Snowflake’s two-layer serving architecture, which separates a controller layer from inference engines, exemplify the microservices approach to this problem. This lesson focuses on system-level serving strategies. The next lesson on Model Optimization for Production covers hardware-level techniques like quantization and compilation that complement what you learn here.
Model caching and pre-computation
The fastest inference call is the one you never make. Two complementary strategies make this possible, and understanding when to apply each one is essential for any ML serving design.
Result and feature caching
Result caching stores inference outputs keyed by input features in low-latency stores like Redis or Memcached. When a user-item pair score has already been computed, the system returns the cached value in sub-millisecond time instead of running the model again. The effectiveness of this approach depends on the
Cache entries need a TTL (time-to-live) policy that balances freshness against hit rate. Short TTLs keep results fresh but reduce hit rates. Long TTLs improve hit rates but serve stale scores. When user context changes rapidly, such as a user switching locations in a ride-hailing app, cache invalidation becomes complex because the cached score no longer reflects the current state.
Feature caching operates one layer deeper. Instead of caching final scores, it caches expensive intermediate feature computations. Uber’s ETA system, for example, caches precomputed geographic features to avoid redundant feature store lookups on every request. This reduces per-request latency even when the final inference must run online.
Practical tip: In an interview, specify which layer you are caching (features vs. scores) and justify the TTL window based on how quickly the underlying signal changes.