Training an Automatic Speech Recognition System

Explore the design and training of automatic speech recognition systems by understanding fundamental ASR concepts, model architectures like Whisper, and the end-to-end process from data preparation to evaluation metrics. Learn how to handle diverse languages, accents, and noisy environments to build robust speech-to-text models.

We'll cover the following...

Requirements
- Functional Requirements
- Nonfunctional Requirements
Model selection
The training process

Traditional ASR systemsBuilt on hand-crafted features like Mel-Frequency Cepstral Coefficients (MFCCs) and statistical models such as Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs). Acoustic, pronunciation, and language models are trained separately. face several challenges, including sensitivity to background noise, speaker accents, and variations in speech patterns, which reduce accuracy in real-world conditions. These systems often have limitations in multilingual recognition, code-switching, and the recognition of diverse accents, which can reduce accessibility and impact inclusivity. Modern ASR models like OpenAI’s Whisper v3 and Google DeepMind’s Canary have significantly improved ASR by leveraging large-scale self-supervised learning on diverse, multilingual datasets. These models demonstrate superior robustness to noise, accents, and low-resource languagesLanguages that lack large datasets, standardized transcripts, or computational tools, making model training challenging., enabling more accurate transcription in real-world scenarios.Built on hand-crafted features like Mel-Frequency Cepstral Coefficients (MFCCs) and statistical models such as Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs). Acoustic, pronunciation, and language models are trained separately.

Note: In this lesson, we will look at the fundamental concepts of ASR, the architecture of ASR models, the training process, and evaluation methods to build an end-to-end ASR system. As with any system, we will start with the requirements.

Requirements

The development of an image captioning system begins with identifying the functional requirements that shape its essential behaviors and the non-functional requirements that determine its performance and reliability.

Functional Requirements

The functional requirements of an ASR system are:

Speech-to-text conversion: The model should accurately transcribe spoken language into text while accounting for pronunciation, accent, and background noise variations.
Audio preprocessing: The system must handle noise reduction, echo cancellation, and feature extraction.
Real-time processing: The ASR model must support low-latency, real-time transcription for use cases such as live captioning and virtual assistants.
Multilingual support: The ASR system must support recognition across multiple languages and dialects while maintaining consistent accuracy.

Nonfunctional Requirements

The nonfunctional requirements for an ASR system are:

Accuracy: The error rate (e.g., WER) should be minimized, particularly in noisy environments and for diverse accents and dialects.
Low latency: The ASR system should process and transcribe speech with minimal delay to support real-time applications.
Security and privacy: All user data (audio, text, and metadata) must be securely stored, encrypted, and handled according to data protection regulations.
Scalability: The system should handle large-scale deployments, supporting multiple users and languages simultaneously.

With our requirements decided, we can choose a model for our system.

Model selection

Selecting the right model architecture is a key step in designing an automatic speech recognition (ASR) system. Model selection directly affects how well the system can generalize, scale to large datasets, and maintain accuracy under different workload conditions, and impacts both training complexity and deployment constraints.

We will focus on modern self-supervised learning (SSL) architectures for ASR, as they have demonstrated superior generalization and robustness over traditional methods. One such model is Whisper, a transformer-based ASR model that ...

1.Introduction to GenAI System Design

2.Fundamental Concepts in GenAI

Breakout Session

3.Back-of-the-envelope Calculations

4.Systematic Framework for Designing GenAI Systems

5.System Design of a Text-to-Text Generation System

Mock Interview

6.System Design of a Text-to-Image Generation System

Mock Interview

7.System Design of a Text-to-Speech Generation System

Mock Interview

8.System Design of a Text-to-Video Generation System

Mock Interview

9.System Design of an Image Captioning System

10.System Design of an Automatic Speech Recognition

11.System Design of Retrieval-Augmented Generation (RAG)

12.Conclusion

13.Free GenAI System Design Lessons

Training an Automatic Speech Recognition System

Requirements

Functional Requirements

Nonfunctional Requirements

Model selection