Jump to content

Community Wishlist/Wishes/Wikipedia Machine Translation Project

From Meta, a Wikimedia project coordination wiki
Wikipedia Machine Translation Project Open

Edit wish Discuss this wish

Description

Machine translation is getting better and better, it is far from perfect but at least for some language pairs, machine translations of full texts of Wikipedia articles are soon approximating flawlessness. They still have errors – often many of them – but keep in mind they 1) will keep getting increasingly better over the years and 2) one could create ways for post-machine-translation error correction (described at bottom).

One can test DeepL to see the state of the art – here no post-translation editing was needed. (more media…)

I think this is a really important project affecting many people and I hope outlining it here and now is a step towards making it a reality.

The problems addressed here or why this is needed

Wikipedia coverage (not considering measures like number of sources, number of watchers/readers, article length, …)
  • Most Wikipedias have far fewer articles than English Wikipedia. In addition, the quality or comprehensiveness is usually significantly lower.
    • The translations don't need to be based on only English Wikipedia (ENWP) – other Wikipedias could also be used as the base text, mainly for subjects specific to some country/language.
    • For example this study found that even the most extensive language editions only cover about one-third of the climate change-related articles that ENWP has.
  • Other Wikipedias are usually not as up-to-date as ENWP or the WP that is specific to the subject.
  • For manually translated articles, changes are not synced to the other languages Wikipedia. That is regardless whether or not machine translation was used for translation of course: when the source article changes then these changes are not also made to the other language(s) Wikipedia(s). Articles get expanded with missing info, updated with newer info, overhauled, or restructured all the time.
    • For example, I wrote an article in ENWP then translated it to create a new German-language article; then some time later extensively edited the English WP article but these changes are now missing in the German article.
  • Manually translating articles is not very fulfilling as a volunteer activity – there are probably exceptions but most people like to rather write new things. For any language Wikipedia, there are countless of important notable articles that haven't been created but could have just been translated from ENWP.
  • Smaller articles/Wikipedias are not as much monitored, edited, read and checked as ENWP so there's a greater likelihood of issues in these, including more bias.
  • Often, there is no article for a subject in a person's language – a machine translated version of the best or EN Wikipedia article on it is usually better than having no such article as an option. The user will be made aware that it's machine translated. Also beneficial here is that this means greater reach since content on ENWP can now be accessed also by those reading or searching the Web in other languages.
  • For subjects for which a language's Wikipedia has an article, that article may be extremely short or in a miserable state while another WP, usually ENWP, has a good-quality comprehensive article on it. ENWP articles are usually much more comprehensive. In such cases the reader may seek to or benefit from also opening the machine-translated version. This also applies in general: people only have an extra option, they can and will still read their native language Wikipedia.

This is a proposal to improve internationalization/multilingualism and open knowledge and probably not only ENWP would be used as source despite that machine translation currently works best for that language and that ENWP is the global Wikipedia (possible exceptions: lang/country-specific subjects, cases where quality is much better in another language, cases where no ENWP article exists).

Third-party implementation

Here is a key consideration: if the Wikimedia movement does not set up such a project I think at some point some external actor or organization will do so and it's better if we as Wikimedians set it up. For example, then we can participate in decisions, better guarantee neutrality, contribute to the module described below, and so on and it could integrate better with Wikipedia such as via an extra button in the Languages dropdown of articles.

Wikipedias' text contents are CCBY-SA – I think this is more a question of when and how it will be done than a question of whether it will be.

Post-machine-translation error correction module

The machine translated pages would not be editable so that the articles are forked once the machine translated article is set up – instead, these articles change whenever the source article changes (or a short while thereafter) so they are in sync. People can only correct errors and the key thing is that corrections are also applied in the next revision of the translated article (after the source article has changed and the translated article is updated).

  • Tools like DeepL and Google Translate don't convert wikilinks or citation templates. This system could be built so it understands/uses wikitext markup.
    • If a wikilink's article exists in the target language instead of translating the words, it could use the title of that article in the target language.
    • It could convert citation templates to that language's Wikipedia citation template. For example, last1= in en:Template:cite journal equates to apellidos= in es:Plantilla:Cita publicación
    • The wikitext makes it harder for machine translation tools to translate text – currently either there are more errors initially or the user has to first remove the wikitext / copy-paste the text without the wikitext and remove things like [1][2]. With a streamlined Wikipedia machine translation system, the wikitext would be removed/ignored for the resulting machine translation.
  • As noted earlier, there often are mistranslations – with the roughly envisioned system:
    • one would fix these once and then when the translated article gets updated because the WP article was changed, it keeps these previous corrections (either as is via recognizing the phrase or by 'remembering' the change made previously and applying it anew to the changed sentence)
    • there likely are further things one can do with such a system such as via greater use of novel AI NLP models such as letting these improve the translations (see this study) and translating with all the different languages Wikipedia articles as a virtual context to improve the quality of translation, for example to correctly translate ambiguous words (see phab:T155847)
    • most errors are only 'instances' of a general flaw (common errors/issues) so many of these errors could be fixed at once or marked for likely-error in rule-type ways
      • For example, in this study human evaluation of translated texts was used for systematically identifying various issues with MT outputs. For example there can be errors in the source text or ambiguous parts whose correct translation requires common sense-like semantic language processing or context.
      • In this paper it is noted that Machine translation researchers might find opportunities to expand the models available to Wikipedians for translating articles into their language. – what is proposed here may be of interest to organizations and researchers who could greatly benefit via collaboration and from the collective intelligence of Wikimedias identifying translation errors, improving the overall state of machine translation which in turn improves the state of the MTWP.
      • Whenever a correction is made, the mistranslation could get registered so more instance of that error in other articles get at some point 'marked' so they can get fixed at scale, improving the overall site's quality of translations (note: that there are some spelling errors in the largest Wikipedias does not make the sites useless, the same applies here to a different extent and the sooner work on this starts the better I think).

Basically, when you use machine translation and make corrections, if these are transparent they can be used to improve the MT overall and change similar cases. One can also use two different machine translation and then use the diff to let the user decide which is better or if it needs to be changed as a support for machine translation. For example, for these subtitles I used two different MT tools and then edited the result using the diff. One 'mistake' that one can fix with systematic transparent correction is that large numbers like 96 should be translated as numbers, not "sechsundneunzig" in this case – one would make a MT rule for that so that large numbers are always translated like so when numeric in the source article, and such can also be identified implicitly by learning from all the transparent adjustments people do.

A new Wikimedia project or a project to add a big feature? Next steps

Maybe this will turn into a new project proposal, if you are interested in this please comment/sign up on the talk page since this needs a collective effort, including the involvement of developers, researchers, and editors.

I'm not affiliated with Google, DeepL or alike (also a modular approach would probably be best). This project would be a large boost for the public domain, free knowledge and global education and make us all far more productive. It would help ensure the Wikimedia projects benefit from AI rather than vice versa.

It sounds similar to recently approved Abstract Wikipedia in that In Abstract Wikipedia, people can create and maintain Wikipedia articles in a language-independent way but it is very different.

Maybe it would not become a separate project and is implemented more as a kind of 'feature' or part of Wikipedia but I think in any case a separate site would probably be best. If it will or would have to become a new Wikimedia project, there could be a lot of changes on the standard Wikipedia side as well.

Assigned focus area

Unassigned.

Type of wish

Feature request

Wikipedia

new project

Affected users

Mainly Wikipedia readers without good English reading skills and editors

Other details