Search⌘ K
AI Features

Deploying an Automatic Speech Recognition System

Explore how to deploy an Automatic Speech Recognition system at scale by estimating resources, designing modular subsystems for audio preprocessing, model inference, and post-processing. Understand how to handle massive user traffic using GPU servers and optimize bandwidth while maintaining transcription quality with Whisper V3.

In the previous lesson, we trained and evaluated Whisper V3 for automatic speech recognition (ASR). Now that we have a production-ready model, the next step is designing the system that deploys it at scale.

This lesson walks through the full System Design, from resource estimation to a detailed architecture, showing how different services work together to handle real-world audio transcription traffic reliably and efficiently.

Let’s start with the resource estimation.

Resource estimation

Before designing the system, we need to estimate the resources required. We’ll look at three key areas: storage, inference servers, and network bandwidth. All estimates are based on 100 million daily active users, each submitting 10 audio clips per day.

Storage estimation

Storage requirements can be divided into two categories: stable storage (updated infrequently) and dynamic storage (scales with usage).

Stable storage

  • Model weights (Whisper V3, ~1.5 billion parameters at FP16 precision): 3 GB\text{3 GB}

  • User profile data (100 M users x 10 KB each):  1 TB\text{~1 TB}

Dynamic storage

Each audio clip is assumed to be 30 seconds long, encoded at 256 kbps, roughly 1 MB= 256x10^3 bits/sec x 30 seconds audio = 32x10^3 Bytes/sec x 30 seconds audio = 0.96 MB ~ 1 MB per clip per clip.

  • Daily audio uploads: ...