Model Optimization for Production

Explore essential model optimization strategies to enhance performance and reduce costs in production machine learning systems. Understand how quantization, pruning, model compilation, and target hardware choices influence latency and efficiency. This lesson equips you to make informed decisions balancing accuracy, resource use, and deployment constraints for scalable, production-ready ML services.

We'll cover the following...

Quantization for inference
- Post-training quantization vs. quantization-aware training
  - Accuracy-latency trade-offs in practice
Pruning and sparsity in serving
Model compilation frameworks
- TorchScript, TensorRT, and ONNX Runtime
  - The production trade-off triangle
Edge vs. cloud inference
Conclusion

You have a recommendation ranker that hits every accuracy target during offline evaluation. Then it lands on GPU inference servers handling 50K queries per second, and the p99 latency blows past the SLO budget. The model is correct. It is just too slow and too expensive to serve. This is where hardware-level model optimization becomes essential. With system-level strategies like caching, load balancing, and auto-scaling already in place from the previous lesson, the next lever for reducing per-request cost and latency is optimizing the model itself on the target hardware.

Interviewers at L5 and above expect candidates to articulate not just what model to serve but how to make it fast and cheap on specific hardware. This lesson covers four optimization axes that directly address that expectation: quantization, pruning, model compilation, and edge vs. cloud placement. Applied correctly, these techniques can cut inference cost by 2–4× without retraining the model.

Quantization for inference

Quantization reduces the numerical precision of model weights and activations. A standard trained model stores parameters in FP32 (32-bit floating point). Quantization converts these to lower-bit representations like FP16 (16-bit half-precision) or INT8 (8-bit integer), which consume less memory and execute faster on hardware with native support for reduced-precision arithmetic.

Post-training quantization vs. quantization-aware training

Two primary approaches exist, and each fits different production constraints.

Post-training quantization (PTQ): This method converts a trained FP32 model to lower precision without any retraining. A representative calibration datasetA small, representative subset of production data passed through the model to determine optimal per-layer scaling factors that map FP32 value ranges to INT8 ranges. is run through the network to determine scaling factors per layer. PTQ is fast to apply but can degrade accuracy on layers with wide activation ranges.
Quantization-aware training (QAT): This approach simulates quantization noise during the training loop itself, so the model learns weight distributions that are robust to reduced precision. QAT yields better accuracy than PTQ but requires a full training cycle, making it more expensive to apply.

Accuracy-latency trade-offs in practice

FP16 inference typically preserves accuracy within 0.1% of ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Model Optimization for Production

Quantization for inference

Post-training quantization vs. quantization-aware training

Accuracy-latency trade-offs in practice