Deploying an Automatic Speech Recognition System

Explore how to deploy an Automatic Speech Recognition system at scale by estimating resources, designing modular subsystems for audio preprocessing, model inference, and post-processing. Understand how to handle massive user traffic using GPU servers and optimize bandwidth while maintaining transcription quality with Whisper V3.

We'll cover the following...

Resource estimation
- Storage estimation
  - Stable storage
  - Dynamic storage
Inference server estimation
Bandwidth estimation
High-level System Design
Achieving functional requirements
Detailed System Design
Putting it all together
- Infrastructure components
- End-to-end workflow
Achieving nonfunctional requirements
Conclusion

In the previous lesson, we trained and evaluated Whisper V3 for automatic speech recognition (ASR). Now that we have a production-ready model, the next step is designing the system that deploys it at scale.

This lesson walks through the full System Design, from resource estimation to a detailed architecture, showing how different services work together to handle real-world audio transcription traffic reliably and efficiently.

Let’s start with the resource estimation.

Resource estimation

Before designing the system, we need to estimate the resources required. We’ll look at three key areas: storage, inference servers, and network bandwidth. All estimates are based on 100 million daily active users, each submitting 10 audio clips per day.

Storage estimation

Storage requirements can be divided into two categories: stable storage (updated infrequently) and dynamic storage (scales with usage).

Stable storage

Model weights (Whisper V3, ~1.5 billion parameters at FP16 precision): $\text{3 GB}$
User profile data (100 M users x 10 KB each): $\text{~1 TB}$

Dynamic storage

Each audio clip is assumed to be 30 seconds long, encoded at 256 kbps, roughly 1 MB= 256x10^3 bits/sec x 30 seconds audio = 32x10^3 Bytes/sec x 30 seconds audio = 0.96 MB ~ 1 MB per clip per clip.

Daily audio uploads: ...

1.Introduction to GenAI System Design

2.Fundamental Concepts in GenAI

Breakout Session

3.Back-of-the-envelope Calculations

4.Systematic Framework for Designing GenAI Systems

5.System Design of a Text-to-Text Generation System

Mock Interview

6.System Design of a Text-to-Image Generation System

Mock Interview

7.System Design of a Text-to-Speech Generation System

Mock Interview

8.System Design of a Text-to-Video Generation System

Mock Interview

9.System Design of an Image Captioning System

10.System Design of an Automatic Speech Recognition

11.System Design of Retrieval-Augmented Generation (RAG)

12.Conclusion

13.Free GenAI System Design Lessons

Deploying an Automatic Speech Recognition System

Resource estimation

Storage estimation

Stable storage

Dynamic storage