Jump to content

Research:Content Translation language imbalances

From Meta, a Wikimedia project coordination wiki

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


The flow of translations between Wikipedia language editions shows extreme imbalances in favor of one direction of translations versus the other, when examining records produced by the Content Translation tool.[1]

We began to test some of the potential common-sense explanations for these translation imbalances, with nothing conclusive to show at this point. Some of our guesses included: relative size of the language editions, colonial relationships between languages, and technical factors such as software design.

If these imbalances are undesirable, maybe they can be counteracted by an experimental intervention.

Glossary

[edit]

Specific technical terms used by this project:

Content Translation (CX) - a MediaWiki extension providing the main user interface for assisted translation between Wikipedia languages. Sometimes this term will also be used to include the server component.

CX service - the backend for Content Translation, a Node.js server which supports section editing, machine translation, and suggestions.

Draft translation - an article in the Content Translation workflow which has some in-progress text but is not yet available to readers.

Language pair - the source and target language of a translation or a suggestion.

Language proficiency - a translator's self-reported skill level in a given language, eg. as notated using the Babel extension.

Local wiki expertise - a translator's edit count (bucketed to a range) on the target wiki.

Non-primary language(s) - the languages a person uses less often and less proficiently. Also "second language" (L2).

Primary language(s) - the language(s) a person uses most frequently and proficiently, also called less accurately the "native", "first" language (L1), or "mother tongue".

Published translation - an article which has been published to the wiki, at the end of the Content Translation workflow.

Reciprocal translation pair - the translation flows in each direction between a pair of languages.

Reverse translation - term for translation from a primary into a non-primary language. This is not necessarily less common than forward translation.

Source language - the language being translated from.

Suggestion - articles recommended for translation, for the chosen language pair.

Target language - the language being translated into.

Translation hegemony - A measure of the overall imbalance in translation flows between one language and all others. Defined as the ratio between all translations out of the language, divided by all translations into that language. A->* / *->A

Translation ratio - measure of imbalance in reciprocal translation flows between two languages. Usually viewed from the side with more outgoing translations, in other words the side having a ratio greater than 1. A->B / B->A

Universal Language Selector - a MediaWiki extension responsible for showing language pickers, and tracking which languages the user has chosen in the past.

Research questions

[edit]

RQ 1: Are CX translation imbalances endogenous?

RQ 1.1: How large are the current CX imbalances?  Are there patterns?

RQ 1.2: How do organic, off-wiki translation flows compare?

RQ 1.2.1: What is a typical translation ratio in off-wiki contexts?

RQ 1.2.2: How does language proficiency correlate with CX usage?

RQ 1.2.3: How does local wiki expertise relate to CX usage?

RQ 1.3: What factors affect translation flows?

RQ 1.3.1: How does CX software impact flows?

RQ 1.3.1.1: What is the optimal initial language pair to suggest for a given translator?

RQ 1.3.1.2: What is selected first, the article or the Source Language? If the article is selected first then the source language is directly the language in which the article was being read but if the source language is selected first then the suggestion of the article will depend on the articles available in that language.

RQ 1.3.2: Is translation ratio proportional to language readership?

RQ 1.3.3: Proportional to language editorship?

RQ 1.3.4: Proportional to language article count?

RQ 2: Are there potentially ways to change flow ratios?

RQ 2.1: Under what conditions is this principled?  Is inaction principled?

RQ 3: What is the effect of machine translation availability on translations?

RQ 3.1: What is the effect of MT availability on translation flow?

RQ 3.1.1: When machine translation becomes available for a language pair, does translation volume increase?

RQ 3.2: What is the effect of MT availability on translation quality and acceptance?

RQ 3.2.1: Does published article quality decrease when MT is enabled?

RQ 3.2.2: Is MT quality related to the target language wiki size?

RQ 3.2.3: Are the quality arguments given for disabling machine translation into English still valid today?

RQ 4: What content is being translated?

RQ 4.1: What is the translation count by categories?

RQ 4.1.1: What type of content receives the biggest count?

RQ 4.1.2: What type of content receives the lowest count?

RQ 4.2: How much translation originates in a CX suggestion, vs. spontaneous?

RQ 4.3: How does CX make article suggestions?  What are the factors considered?  Is there personalization?

RQ 4.4: Does the existing “translate this page” section translation feature counteract language bias?

Methods

[edit]

This is a draft outline of work, not a plan yet.

  • Passive analysis of Content Translation historical logs
    • Done Compare and visualize flows between all languages. What can be observed?
    • Done Look for correlations between translation flow and the relative values for each language in the pair: total articles, active editors, pageviews, ...
    • Done Compare smaller subsets of languages.
    • In progress… Segment all statistics according to whether the published translation originated in a suggestion, whether machine translation was explicitly used, and whether machine translators are available externally or internally for the language pair.
  • Passive analysis of Content Translation source code.
    • Done How are suggested language pairs chosen?
  • Instrument Content Translation with temporary, additional, structured log events.
    • task T241833: Send an event identifying which group of inputs the suggested translation source language came from.
  • Interviews with translators
    • Learn about how perceived language importance informs choice of languages
    • Learn how software design affects choice of languages and workflow
  • Experimental intervention
    • For example, changing the suggested translation target for a limited number of users to eg. translate away from the current language instead of into it.

Outreachy involvement

[edit]

From May through August of 2023, Wikimedia Foundation and Wikimedia Germany provided resources to hire two fantastic interns through the Outreachy program: Nathaly Toledo and Abhishek Bharjwaj. This intensive collaboration is the source of most of the material in our study so far.

Policy, Ethics and Human Subjects Research

[edit]

No experiments are currently planned.

Results

[edit]

Is there a translation imbalance?

[edit]

Going through logs of translations published using the Content Translation tool, we can compare the number of translations in each direction of a language pair to find what we call the "translation ratio". To take an example, roughly 112,000 articles have been translated from English to Spanish but only 4,200 from Spanish to English, for a translation ratio of 28.5 : 1. Overall, English shows a dominance over other language editions which is far out of proportion to its relative size, as can be seen in the graph below.

Circular diagram with the magnitude of flows between each language.
A Sankey diagram showing that English is the biggest source of translations to other language Wikipedias. Data source: https://en.wikipedia.org/w/api.php?action=query&list=contenttranslationstats&format=json R code:
library(circlize)
chordDiagram(api.result.translate.json.pivot.selection, directional = 1, direction.type = c("diffHeight", "arrows"), link.arr.type = "big.arrow")

By another measure used in this study, "translation hegemony" we compare the total number of outgoing vs. incoming translations for a single language, and find that English is being translated to all other languages at a ratio of 41.4 : 1 for every article translated into English, while Spanish is mostly receiving translations overall with a hegemony ratio of 0.57 : 1, or roughly 1 outgoing translation for every 2 incoming translations.

These imbalances seem to always flow from a dominant language towards the languages with a smaller number of wiki articles. Colonial relationships between languages are reproduced, for example English towards Spanish and Spanish towards Catalán (4:1). Similarly-sized languages without strong geographical or colonial relationships show much different characteristics, for example German and Spanish are within 50% in number of wiki articles, but have inversed hegemony ratios (0.57:1 for Spanish vs. 3:1 for German).

Analysis of suggested translation language algorithm

[edit]

On a user's first visit to the Translations page after enabling the Content Translation beta feature, they can find suggestions about articles to translate. There are two algorithms at play: one chooses the pair of source and target languages between which to translate, and the other chooses which articles to show for translation. The analysis in this section is focused on the initial default choice of translation languages.

The code responsible for setting the default languages is CXDashboard.findValidDefaultLanguagePair, and the rough outline is that it takes all languages that the user has frequently set in the Universal Language Selector using mw.uls.getFrequentLanguageList, picks the first one, and suggests translating from that language into the current wiki's language. The exact process is more complicated:

Activity diagram detailing the Content Translation calculation to find a default suggested translation language pair.

TBD: discuss alternatives to this algorithm, such as randomizing all valid language pair permutations, recommending multiple pairs; and instrumenting the algorithm output

The target language strongly defaults to the current wiki language, and then a source language must be chosen which is different than the target. The source language defaults first to the interface language, set in MediaWiki user preferences, or browser preferences and accept headers. Languages explicitly (?) chosen with ULS are also retrieved from localStorage under 'uls-previous-languages'.

TBD: give examples of fallback values

Translate to vs. translate from workflows

[edit]

There are good reasons that translators might be fluent in a smaller language and in a trade or world language, and it seems that many translators are comfortable working in either direction. It's this choice of direction which creates the imbalance seen in our research. But the choice of direction is often already decided by the time users enter the translation workflow: the two directions can be summarized as "find an article to translate into your language" vs. "translate this article from your language".

We would like to analyze translations made through these two workflows, to see if our assumption is correct that the workflows mostly correspond to a single direction of translation flow.

Resources

[edit]
[edit]

This section is very much in-progress, and we'll add more as we learn about it.

Production source code

[edit]

Analysis code

[edit]

References

[edit]
  1. Content Translation provides an assisted translation environment with visual editing, intelligent template transformation, and machine translation integration. The project is mature and has been used to create over one million articles.