Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation
Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.
Overview
[edit]TTS Engine | Type | Licence | Languages | Costs (USD/character) | SSML | Voices |
---|---|---|---|---|---|---|
phoneme-synthesis + meSpeak.js | Library | GPLv3 (open source) | 24 | N/A | 29 | |
larynx | CLI/API | MIT (open source) | 9 | N/A | 50 | |
espeak-ng | CLI/API | GPLv3 (open source) | 127 | N/A | 127[nb 1] | |
Google Cloud | API | Closed source | 40 | 0.000004 | 100 | |
IBM Cloud | API | Closed source | 13 | 0.00002 | 26 | |
Microsoft Azure | API | Closed source | 129 | 0.000016 | 270 | |
Amazon AWS | API | Closed source | 22 | 0.000004 | 66 |
Requirements
[edit]The TTS engine we pick should:
- accept SSML (speech synthesis markup language), as an emerging W3C standard[1]
- produce acceptable quality speech synthesis
- support as wide a range of languages as possible
Audio samples
[edit]- https://tnt-dev.toolforge.org/projects/tts (work in progress)
phoneme-synthesis + meSpeak.js
[edit]Notes
[edit]meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?
Licence
[edit]Languages & voices
[edit]24 languages (29 voices) are supported, with varying completeness[3]
- Catalan
- Czech
- German
- Greek
- English
- Esperanto
- Spanish
- Finnish
- French
- Hungarian
- Italian
- Kannada
- Latin
- Latvian
- Dutch
- Polish
- Portuguese
- Romanian
- Slovak
- Swedish
- Turkish
- Mandarin Chinese
- Cantonese Chinese
Quality
[edit]Better than larynx out of the box, but could be better with some tweaking.
Costs
[edit]N/A
SSML
[edit]SSML support can be enabled via a flag.[2]
Notes
[edit]Has some issues with (ə)
Links
[edit]larynx
[edit]Notes
[edit]larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML
Licence
[edit]Languages & voices
[edit]9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]
- English
- German
- French
- Spanish
- Dutch
- Italian
- Swedish
- Swahili
- Russian
Quality
[edit]Tested, fairly poor with default settings, will require a lot of tweaking.
Costs
[edit]N/A
SSML
[edit]Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]
Notes
[edit]Links
[edit]- GitHub
- TheresNoTime's fork
- Languages/Voices
- SSML support
- CommTech's test installation: https://larynx-tts.wmcloud.org/openapi/
espeak-ng
[edit]Notes
[edit]meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.
Licence
[edit]Languages & voices
[edit]Quality
[edit]Untested
Costs
[edit]N/A
SSML
[edit]Similar to meSpeak.js, a subset of SSML is supported.
Notes
[edit]Links
[edit]Google Cloud
[edit]Notes
[edit]API
Licence
[edit]- Proprietary
Languages & voices
[edit]40 languages (100+ voices)
Quality
[edit]As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs
[edit]All costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.
Free quota
[edit]- 4 million characters per month
Then
[edit]- $0.000004 USD per character
SSML
[edit]Fully supported
Notes
[edit]Links
[edit]IBM Cloud
[edit]Notes
[edit]API
Licence
[edit]- Proprietary
Languages & voices
[edit]13 languages (26 voices) are supported[10]
- Arabic
- Chinese
- Czech
- Dutch
- English
- French
- German
- Italian
- Japanese
- Korean
- Portuguese
- Spanish
- Swedish
Quality
[edit]Untested
Costs
[edit]All costs are based on publicly available pricing.
Free quota
[edit]- 10,000 characters per month
Then
[edit]- $0.00002 USD per character
SSML
[edit]Fully supported
Notes
[edit]Links
[edit]Microsoft Azure
[edit]Notes
[edit]API
Licence
[edit]- Proprietary
Languages & voices
[edit]129 languages (270 voices) are supported
Quality
[edit]As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs
[edit]All costs exclude "Custom Neural" voices, and are based on publicly available pricing.
Free quota
[edit]- 0.5 million characters per month
Then
[edit]- $0.000016 USD per character
SSML
[edit]Fully supported
Notes
[edit]Links
[edit]Amazon AWS
[edit]Notes
[edit]API
Licence
[edit]- Proprietary
Languages & voices
[edit]22 languages (66 voices) are supported
Quality
[edit]As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs
[edit]All costs are based on publicly available pricing.
Free quota
[edit]- 5 million characters per month (for 12 months)
Then
[edit]- $0.000004 USD per character
SSML
[edit]Fully supported
Notes
[edit]Links
[edit]See also
[edit]Footnotes
[edit]References
[edit]- ↑ "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.
- ↑ a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.
- ↑ "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.