Jump to content

Lingua Libre/GSOC25

From Meta, a Wikimedia project coordination wiki

Lingua Libre extension

[edit]
This section is currently a draft. You can improve it.
Lingua Libre v.2.0, Homepage

In the field of Language diversity, Wikimedia Foundation and Wikimedia France have supported LinguaLibre.org, a single page VueJS application to rapidly record vocabularies of the world. Over 280 languages and 1.3 millions words have been audio recorded into Wikimedia sites through this open project.
Recent Django/Vuejs/MariaDB revamp of the core app broke meaningful adds-ons. Those front-end features should be rebuilt upon the new database :

  1. overall languages dashboard (legacy)
  2. versatile search page (legacy),
  3. statistics (legacy: Global stats, Languages, Speakers, Chronological)
  4. minimal bilingual dictionaries system, ideally with minimalist micro-learning feature (legacy: 1, 2).

This will likely imply expansions of Lingua Libre APIs as well.

Lingua Libre v.2.0, Recording Studio

Lingua Libre IOT

[edit]
This section is currently a draft. You can improve it.
2024 collaboration with Occitan Whistle public exhibit lead to the developement of a prototype interactive map playing villages names using a local endangered language. We would like to create an open source toolkit for such displays and similar IRL systems.
See : https://hugolpz.github.io/NamesOfTheLand .

Lingua Libre provides pioneer digital material for locals and minorities. Following 2024's collaboration with Occitan Whistle public exhibit and the creation of an physical interactive map, we want to develop real life open source IOT valorisation of Lingua Libre linguistic data. Target reusers are cultural exhibits, municipal councils, local community, local wikimedians.

Technology Item Worts with internet Allocated time
JS or VueJS, LeafletJS Interactive map table Yes 2 weeks
JS or VueJS Interactive poster table Yes 2 weeks
JS QR code to webpages for area with internet access Yes 2 weeks
Arduino
Solar powered
Screen ?
IOT speaker box with preprogrammed content Without 6 weeks

Those base demonstrators create material table-sized displays in local museums where visitors could press on villages, places, or objects names and hear the native language name for these items. A complementary idea would be a physical play boxes on mountain hike paths where the internet is not available. Visitor could read the minimal instructions, press the box, and hear the native language audio for something they see.

  • Tech stack: Arduino (or equivalent), minimal web coding ability.
  • Size: 350 hours
  • Difficulty: Intermediate
  • Mentor(s): Yug, {TBA}
  • Intern: {Username} TBA
  • Phabricator task: TBA
  • Relevant links: TBA.

Spell4Wiki & Lingua Libre

[edit]
This section is currently a draft. You can improve it.
Logo of Spell4Wiki.

Align Spell4Wiki and Lingua Libre, access Lingua Libre's item lists.

  • Tech stack: Anroid SDK (or equivalent).
  • Size: 350 hours
  • Difficulty: Intermediate
  • Mentor(s): TBA, {TBA}
  • Intern: {Username} TBA
  • Phabricator task: TBA
  • Relevant links: TBA.

WikiSpeech & Lingua Libre integration

[edit]
This section is currently a draft. Cancelled. The WikiSpeech team confirmed their TTS project already has academic researchers on it with no clear need for an GSOC intern.
WikiSpeech

WikiSpeech aims to offer an hyper-multilingual, open source « Listen to this article » Text To Speech services to all Wikipedia projects.
To do so, we want to create a solid pipeline using 1) Lingua Libre's audio sentences and textual datasets for their training data, 2) routines automation to retrain T2S ML models up to professional level, and 3) an online API service which, given an iso and text, would return the relevant audio reading stream. This online service would be open to all *.wikipedia.org queries, providing « Listen to this article » service to all Wikipedias readers.
This project would be supported by Wikimedia Sverige (WikiSpeech), Wikimedia France (Lingua Libre) and Google (GSOC25). You will collaborate with your mentor and Lingua Libre developers.

  • Tech stack: Python/Pytorch (or equivalent), Django/Vuejs, Makefiles or alternative.
  • Size: 350 hours
  • Difficulty: Intermediate
  • Mentor(s): — (cancelled)
  • Intern: — (cancelled)
  • Phabricator task: — (cancelled)
  • Relevant links: — (cancelled)

WikiSpeech-Lingua Libre meetup

[edit]
  • Date: 2025.02.12
  • Participants: André, Sebastian, Yug
  • Topic: « WikiSpeech-Lingua Libre meetup, exploring possible collaborations »

Summary: WikiSpeech TTS unique challenge is to provide a TTS service while keeping a wiki-like correction feedback channel, so key words in Wikipedia articles are read as accurately as possible and rapidly correctable. This in multiple languages. WikiSpeech TTS expertise is provided by Swedish machine learning (ML) research centers which do not need dedicated ML GSOC25 intern. Still, Yug reminds of its possibility if wanted (deadline: Mar. 24th, 2025).

Providing training data : WikiSpeech also look to collect reading samples, which aligns with Lingua Libre recent sentences and texts recording capability and possibility to share predefined list (T313575).

Needed features (?): Provide users noticing a mispronounced word with a preloaded Lingua Libre link (language + word : open the recorder, record, upload with correct tag), would help. Feature request can be submitted on phabricator. A WikiSpeech web developer can easily contribute on Lingua Libre repository (MariaDB, Django, VueJS). Lingua Libre lists generators could be opened via API or split into a common service.

Other links

Examples

[edit]

Others

[edit]
Title Stack Workload Description Members
Flex / FieldWorks (?) C/C++/Django ? collaboration with leading lexicographic software to ease co-integration https://github.com/sillsdev/FieldWorks ?