What is Amazon Polly?

Amazon Polly is an Amazon Web Services (AWS) service that allows developers to integrate natural-sounding speech synthesis capabilities into their applications. Amazon Polly is a text-to-speech (TTS) service that uses advanced deep learning technologies to convert text into lifelike speech, making it possible to develop voice-driven applications.

Amazon Polly allows to control voices, languages, and speaking styles. Amazon Polly offers a collection of male and female voices in 24 languages. Users can select any of these voices to customize their audio output according to their preferences and target audience.

How does Amazon Polly work?

Amazon Polly uses deep learning technologies to analyze the provided text data. Different features of Amazon Polly work together to produce natural-sounding speech from the text. The Polly Neural text-to-speech (NTTS) voices utilize machine learning techniques to adjust intonation and rhythm, making speech lifelike.

Let’s consider an example of a mobile application that helps users learn foreign languages. As part of the app’s functionality, the application helps learn user to pronounce words properly. The application receives a request from a user to hear the pronunciation of the word “Bonjour,” which means “hello” in French. The input text is sent to the Amazon Polly, which then converts the input text “Bonjour” into speech using the French voice, taking into account the phonetics and pronunciation rules of the French language.

Primary features of Amazon Polly

Amazon Polly has multiple features to improve the quality of the speech produced as its output. Some of its highlighted features are as follows:

Neural and standard speech: Amazon Polly offers neural and standard text-to-speech voices. The Standard engine produces good and natural-sounding speech. However, the Neural engine enhances the speech, making it more human-like.
Speech Synthesis Markup Language (SSML) supportSSML allows the use of a standardized markup language to control the speech synthesis process, enhancing text-to-speech with prosody, pronunciation, and other vocal effects.: Amazon Polly supports Speech Synthesis Markup Language (SSML). SSML enables pause, emphasis, and intonation for a more natural-sounding experience.
Multiple languages and voices: It supports various languages and accents, allowing developers to choose the most suitable voice for their target audience or application context. It also includes voices of different genders, age groups, and regional accents.
Integration: Polly seamlessly integrates with other AWS services, such as Amazon S3, AWS Lambda, and Amazon Lex, enabling developers to incorporate speech synthesis into their existing AWS workflows and applications.
Speech marks: It contains information about the timestamp of a word or a sentence. It provides time information in milliseconds.
Cost-effective: Amazon Polly has a pay-as-you-go pricing model, so we only get charged for the text we convert to speech.