Search⌘ K

Integrating Speech-to-Text with Whisper v3

Explore integrating speech-to-text functionality into your AI chatbot using Whisper v3 and Gradio. Understand how to capture voice input, transcribe audio with Whisper, and incorporate it seamlessly into your multimodal chatbot interface to enable effective voice interaction.

So far, we have a chatbot that works with both text and images. Another type of modality that can be added here is voice. First, let’s focus on updating our chatbot to be able to take voice input from the user.

Taking voice as input

Gradio provides a simple Audio component that allows us to take audio as input. Let’s add it to a simple demo.

Running this code might open a pop-up in the browser that requests access to the microphone. Please grant access so that the chatbot can hear our voice.

import gradio as gr

def process_audio(audio):
    # This is where we will process the audio
    return "Audio recorded"

with gr.Blocks() as demo:
    audio_input = gr.Audio(sources=["microphone"])
    text_output = gr.Textbox()

    btn = gr.Button("Process")
    btn.click(process_audio, inputs=audio_input, outputs=text_output)

demo.launch(server_name="0.0.0.0")
A simple audio input demo with Gradio

The code is simple and should be easy to understand now that we have used Gradio a few times. The only new addition is the gr.Audio component defined on line 8. It is ...