Jump to content

Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T307624

Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.

Contents

Overview

[edit]
TTS Engine Type Licence Languages Costs (USD/character) SSML Voices
phoneme-synthesis + meSpeak.js Library GPLv3 (open source) 24 N/A Green tickY 29
larynx CLI/API MIT (open source) 9 N/A Green tickY 50
espeak-ng CLI/API GPLv3 (open source) 127 N/A Green tickY 127[nb 1]
Google Cloud API Closed source 40 0.000004 Green tickY 100
IBM Cloud API Closed source 13 0.00002 Green tickY 26
Microsoft Azure API Closed source 129 0.000016 Green tickY 270
Amazon AWS API Closed source 22 0.000004 Green tickY 66

Requirements

[edit]

The TTS engine we pick should:

Audio samples

[edit]

phoneme-synthesis + meSpeak.js

[edit]

Notes

[edit]

meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?

Licence

[edit]

Languages & voices

[edit]

24 languages (29 voices) are supported, with varying completeness[3]

  • Catalan
  • Czech
  • German
  • Greek
  • English
  • Esperanto
  • Spanish
  • Finnish
  • French
  • Hungarian
  • Italian
  • Kannada
  • Latin
  • Latvian
  • Dutch
  • Polish
  • Portuguese
  • Romanian
  • Slovak
  • Swedish
  • Turkish
  • Mandarin Chinese
  • Cantonese Chinese

Quality

[edit]

Better than larynx out of the box, but could be better with some tweaking.

Costs

[edit]

N/A

SSML

[edit]

SSML support can be enabled via a flag.[2]

Notes

[edit]

Has some issues with (ə)

[edit]

larynx

[edit]

Notes

[edit]

larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML

Licence

[edit]

Languages & voices

[edit]

9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]

  • English
  • German
  • French
  • Spanish
  • Dutch
  • Italian
  • Swedish
  • Swahili
  • Russian

Quality

[edit]

Tested, fairly poor with default settings, will require a lot of tweaking.

Costs

[edit]

N/A

SSML

[edit]

Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]

Notes

[edit]
[edit]

espeak-ng

[edit]

Notes

[edit]

meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.

Licence

[edit]

Languages & voices

[edit]

127[nb 1] languages[8]

Quality

[edit]

Untested

Costs

[edit]

N/A

SSML

[edit]

Similar to meSpeak.js, a subset of SSML is supported.

Notes

[edit]
[edit]

Google Cloud

[edit]

Notes

[edit]

API

Licence

[edit]
  • Proprietary

Languages & voices

[edit]

40 languages (100+ voices)

Quality

[edit]

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

[edit]

All costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.

Free quota

[edit]
  • 4 million characters per month

Then

[edit]
  • $0.000004 USD per character

SSML

[edit]

Fully supported

Notes

[edit]
[edit]

IBM Cloud

[edit]

Notes

[edit]

API

Licence

[edit]
  • Proprietary

Languages & voices

[edit]

13 languages (26 voices) are supported[10]

  • Arabic
  • Chinese
  • Czech
  • Dutch
  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Portuguese
  • Spanish
  • Swedish

Quality

[edit]

Untested

Costs

[edit]

All costs are based on publicly available pricing.

Free quota

[edit]
  • 10,000 characters per month

Then

[edit]
  • $0.00002 USD per character

SSML

[edit]

Fully supported

Notes

[edit]
[edit]

Microsoft Azure

[edit]

Notes

[edit]

API

Licence

[edit]
  • Proprietary

Languages & voices

[edit]

129 languages (270 voices) are supported

Quality

[edit]

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

[edit]

All costs exclude "Custom Neural" voices, and are based on publicly available pricing.

Free quota

[edit]
  • 0.5 million characters per month

Then

[edit]
  • $0.000016 USD per character

SSML

[edit]

Fully supported

Notes

[edit]
[edit]

Amazon AWS

[edit]

Notes

[edit]

API

Licence

[edit]
  • Proprietary

Languages & voices

[edit]

22 languages (66 voices) are supported

Quality

[edit]

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

[edit]

All costs are based on publicly available pricing.

Free quota

[edit]
  • 5 million characters per month (for 12 months)

Then

[edit]
  • $0.000004 USD per character

SSML

[edit]

Fully supported

Notes

[edit]
[edit]

See also

[edit]

Footnotes

[edit]
  1. a b voice count unsure, likely 1 per language at least?

References

[edit]
  1. "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18. 
  2. a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18. 
  3. "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  4. "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  5. Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18. 
  6. "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  7. "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  8. "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  9. "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18. 
  10. "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.