Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation

Tracked in Phabricator:
Task T307624

Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.

Contents

Overview

edit
TTS Engine Type Licence Languages Costs (USD/character) SSML Voices
phoneme-synthesis + meSpeak.js Library GPLv3 (open source) 24 N/A  Y 29
larynx CLI/API MIT (open source) 9 N/A  Y 50
espeak-ng CLI/API GPLv3 (open source) 127 N/A  Y 127[nb 1]
Google Cloud API Closed source 40 0.000004  Y 100
IBM Cloud API Closed source 13 0.00002  Y 26
Microsoft Azure API Closed source 129 0.000016  Y 270
Amazon AWS API Closed source 22 0.000004  Y 66

Requirements

edit

The TTS engine we pick should:

Audio samples

edit

phoneme-synthesis + meSpeak.js

edit

Notes

edit

meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?

Licence

edit

Languages & voices

edit

24 languages (29 voices) are supported, with varying completeness[3]

  • Catalan
  • Czech
  • German
  • Greek
  • English
  • Esperanto
  • Spanish
  • Finnish
  • French
  • Hungarian
  • Italian
  • Kannada
  • Latin
  • Latvian
  • Dutch
  • Polish
  • Portuguese
  • Romanian
  • Slovak
  • Swedish
  • Turkish
  • Mandarin Chinese
  • Cantonese Chinese

Quality

edit

Better than larynx out of the box, but could be better with some tweaking.

Costs

edit

N/A

SSML

edit

SSML support can be enabled via a flag.[2]

Notes

edit

Has some issues with (ə)

edit

larynx

edit

Notes

edit

larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML

Licence

edit

Languages & voices

edit

9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]

  • English
  • German
  • French
  • Spanish
  • Dutch
  • Italian
  • Swedish
  • Swahili
  • Russian

Quality

edit

Tested, fairly poor with default settings, will require a lot of tweaking.

Costs

edit

N/A

SSML

edit

Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]

Notes

edit
edit

espeak-ng

edit

Notes

edit

meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.

Licence

edit

Languages & voices

edit

127[nb 1] languages[8]

Quality

edit

Untested

Costs

edit

N/A

SSML

edit

Similar to meSpeak.js, a subset of SSML is supported.

Notes

edit
edit

Google Cloud

edit

Notes

edit

API

Licence

edit
  • Proprietary

Languages & voices

edit

40 languages (100+ voices)

Quality

edit

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

edit

All costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.

Free quota

edit
  • 4 million characters per month

Then

edit
  • $0.000004 USD per character

SSML

edit

Fully supported

Notes

edit
edit

IBM Cloud

edit

Notes

edit

API

Licence

edit
  • Proprietary

Languages & voices

edit

13 languages (26 voices) are supported[10]

  • Arabic
  • Chinese
  • Czech
  • Dutch
  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Portuguese
  • Spanish
  • Swedish

Quality

edit

Untested

Costs

edit

All costs are based on publicly available pricing.

Free quota

edit
  • 10,000 characters per month

Then

edit
  • $0.00002 USD per character

SSML

edit

Fully supported

Notes

edit
edit

Microsoft Azure

edit

Notes

edit

API

Licence

edit
  • Proprietary

Languages & voices

edit

129 languages (270 voices) are supported

Quality

edit

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

edit

All costs exclude "Custom Neural" voices, and are based on publicly available pricing.

Free quota

edit
  • 0.5 million characters per month

Then

edit
  • $0.000016 USD per character

SSML

edit

Fully supported

Notes

edit
edit

Amazon AWS

edit

Notes

edit

API

Licence

edit
  • Proprietary

Languages & voices

edit

22 languages (66 voices) are supported

Quality

edit

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

edit

All costs are based on publicly available pricing.

Free quota

edit
  • 5 million characters per month (for 12 months)

Then

edit
  • $0.000004 USD per character

SSML

edit

Fully supported

Notes

edit
edit

See also

edit

Footnotes

edit
  1. a b voice count unsure, likely 1 per language at least?

References

edit
  1. "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18. 
  2. a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18. 
  3. "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  4. "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  5. Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18. 
  6. "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  7. "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  8. "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  9. "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18. 
  10. "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.