Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation
Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.
Overview
editTTS Engine | Type | Licence | Languages | Costs (USD/character) | SSML | Voices |
---|---|---|---|---|---|---|
phoneme-synthesis + meSpeak.js | Library | GPLv3 (open source) | 24 | N/A | 29 | |
larynx | CLI/API | MIT (open source) | 9 | N/A | 50 | |
espeak-ng | CLI/API | GPLv3 (open source) | 127 | N/A | 127[nb 1] | |
Google Cloud | API | Closed source | 40 | 0.000004 | 100 | |
IBM Cloud | API | Closed source | 13 | 0.00002 | 26 | |
Microsoft Azure | API | Closed source | 129 | 0.000016 | 270 | |
Amazon AWS | API | Closed source | 22 | 0.000004 | 66 |
Requirements
editThe TTS engine we pick should:
- accept SSML (speech synthesis markup language), as an emerging W3C standard[1]
- produce acceptable quality speech synthesis
- support as wide a range of languages as possible
Audio samples
edit- https://tnt-dev.toolforge.org/projects/tts (work in progress)
phoneme-synthesis + meSpeak.js
editNotes
editmeSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?
Licence
editLanguages & voices
edit24 languages (29 voices) are supported, with varying completeness[3]
- Catalan
- Czech
- German
- Greek
- English
- Esperanto
- Spanish
- Finnish
- French
- Hungarian
- Italian
- Kannada
- Latin
- Latvian
- Dutch
- Polish
- Portuguese
- Romanian
- Slovak
- Swedish
- Turkish
- Mandarin Chinese
- Cantonese Chinese
Quality
editBetter than larynx out of the box, but could be better with some tweaking.
Costs
editN/A
SSML
editSSML support can be enabled via a flag.[2]
Notes
editHas some issues with (ə)
Links
editlarynx
editNotes
editlarynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML
Licence
editLanguages & voices
edit9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]
- English
- German
- French
- Spanish
- Dutch
- Italian
- Swedish
- Swahili
- Russian
Quality
editTested, fairly poor with default settings, will require a lot of tweaking.
Costs
editN/A
SSML
editOnly a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]
Notes
editLinks
edit- GitHub
- TheresNoTime's fork
- Languages/Voices
- SSML support
- CommTech's test installation: https://larynx-tts.wmcloud.org/openapi/
espeak-ng
editNotes
editmeSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.
Licence
editLanguages & voices
editQuality
editUntested
Costs
editN/A
SSML
editSimilar to meSpeak.js, a subset of SSML is supported.
Notes
editLinks
editGoogle Cloud
editNotes
editAPI
Licence
edit- Proprietary
Languages & voices
edit40 languages (100+ voices)
Quality
editAs expected from a commercial service, very good with default settings. No tweaking necessary.
Costs
editAll costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.
Free quota
edit- 4 million characters per month
Then
edit- $0.000004 USD per character
SSML
editFully supported
Notes
editLinks
editIBM Cloud
editNotes
editAPI
Licence
edit- Proprietary
Languages & voices
edit13 languages (26 voices) are supported[10]
- Arabic
- Chinese
- Czech
- Dutch
- English
- French
- German
- Italian
- Japanese
- Korean
- Portuguese
- Spanish
- Swedish
Quality
editUntested
Costs
editAll costs are based on publicly available pricing.
Free quota
edit- 10,000 characters per month
Then
edit- $0.00002 USD per character
SSML
editFully supported
Notes
editLinks
editMicrosoft Azure
editNotes
editAPI
Licence
edit- Proprietary
Languages & voices
edit129 languages (270 voices) are supported
Quality
editAs expected from a commercial service, very good with default settings. No tweaking necessary.
Costs
editAll costs exclude "Custom Neural" voices, and are based on publicly available pricing.
Free quota
edit- 0.5 million characters per month
Then
edit- $0.000016 USD per character
SSML
editFully supported
Notes
editLinks
editAmazon AWS
editNotes
editAPI
Licence
edit- Proprietary
Languages & voices
edit22 languages (66 voices) are supported
Quality
editAs expected from a commercial service, very good with default settings. No tweaking necessary.
Costs
editAll costs are based on publicly available pricing.
Free quota
edit- 5 million characters per month (for 12 months)
Then
edit- $0.000004 USD per character
SSML
editFully supported
Notes
editLinks
editSee also
editFootnotes
editReferences
edit- ↑ "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.
- ↑ a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.
- ↑ "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.