Jump to content

Abstract Wikipedia/Representation of languages

From Meta, a Wikimedia project coordination wiki

Abstract Wikipedia via mailing list Abstract Wikipedia on IRC Wikifunctions on Telegram Wikifunctions on Mastodon Wikifunctions on Twitter Wikifunctions on Facebook Wikifunctions on YouTube Wikifunctions website Translate

Language objects

[edit]
Tracked in Phabricator:
Task T263000

As of March 2021, Wikifunctions represents languages using their MediaWiki language code (e.g. "en" for English, "ja" for Japanese, etc.). The monolingual text representing the English label "Type" for example looks as follows:

{
 "type": "Monolingual text",
 "language": "en",
 "text": "Type"
}
{
 "Z1K1": "Z11",
 "Z11K1": "en",
 "Z11K2": "Type"
}

Here, both "en" and "type" are of type Z6/String.

The suggestion is to change the type for Z11K1/language from Z6/String to a new type, Z60/Natural language. Z60/Natural language has only one key, the Z60K1/language code of type Z6/String. This would mean the above representation would change to the following:

{
 "type": "Monolingual text",
 "language": "English",
 "text": "Type"
}
{
 "Z1K1": "Z11",
 "Z11K1": "Z1002",
 "Z11K2": "Type"
}

Advantages of objects over codes

[edit]
  • Instead of having to memorise language codes, we can use the same search as for every other object
  • When switching interface languages, the reader will see the language names in their language instead of language code they are possibly not familiar with
  • We can use a generic validation of the language object instead of hardcoding something that relies on MediaWiki's language system
  • The latter point is particularly pertinent when thinking about other implementations supporting Wikifunctions code: they would need to reimplement parts of MediaWiki's language system in order to show consistent behaviour

Advantages of code over objects

[edit]
  • We already have an API that offers us a list of codes, and we can use the same source inside of the wiki to validate that the language codes are good
  • We are more used to "en" than "Z1002"
  • The current Z12/Multilingual text component is very fast. It is likely that rewriting the component to use objects will be much slower overall
  • MediaWiki has already really good support for languages, and languages in different names. We could use that instead of developing our own solution to display the name in different languages

Language names

[edit]

Seriously? We already have all the language names in MediaWiki in all languages, and we have the language names in Wikidata, too. Having them again, a third time? How is that cool?

OK, OK, it isn't. Reusing these names would be really sweet.

Here's one thing, though - if we do some one-off coding for that, that relies on being part of a MediaWiki installation running, we would basically require that every evaluation engine recreates that part of MediaWiki. Or relies on CLDR. And either of these feel burdensome.

One way we could do it is to generate the labels of the languages from MediaWiki, and regenerate them as needed, into the data directory of WikiLambda, and then reload these as needed.

An additional step would be to lock down editing of labels for language objects, and relegate the changing of the labels to go through MediaWiki's established process for that. Without locking them down, we might have problems with a two-way sync reconciling changes on-wiki by Wikifunctions contributors with changes coming from the wiki.

BCP 47 mappings

[edit]

We already have mappings from BCP 47 to MediaWiki language codes! Let's reuse those, instead of inventing our own.

I know.

The suggestion is similar to the one for language names: in the end it is just two pretty simple functions. These can be kept up-to-date through a similar approach as outlined above, by recreating the mappings from MediaWiki and upload them and potentially locking down their editing.

Language fallback

[edit]

Let's solve that later. There might be different solutions for interface and for target languages.

Lists of languages

[edit]

There are already a several lists of language. Here are some relevant ones:

  1. Interface languages: the languages that the user interface of MediaWiki supports, i.e. the languages in which user interface elements of MediaWiki can get rendered in (this is distinct from the MediaWiki content languages, see MediaWiki manual on Language)
  2. the list of languages the Lexemes in Wikidata are in
  3. the MediaWiki content languages
  4. the list of Wikipedia language editions
  5. the English Wiktionary's list of all 8,163 language codes that are 'recognised' by their community's templates: wikt:en:Wiktionary:List of languages - there's probably a similar list on the other Wiktionaries

A similar list of lists of languages is Léa's list of lists of languages in Wikidata.

We initially were considering to have interface languages (in which languages is the interface of Wikifunctions available?) to be separate from target languages (in which languages can the Wikifunctions natural language generation library generate content?), but after discussion we decided that it would be more useful for everyone if we keep these two lists aligned. It will probably cost a bit more, but in the end it should improve the situation for everyone.

Initial list of languages

[edit]

The MediaWiki interface languages are based on the languages the MediaWiki user interface supports.

MediaWiki identifies languages using short strings similar to (but not always equivalent to) ISO 639 or BCP 47 codes. The full list of these is available through the LanguageInfo API. There are at time of writing 858 languages in this list. We will start with this list as our first list of languages. The source of the list of these languages is described on the talk page.

The list of languages we currently support in Z12/Multilingual text and that the uselang parameter accepts is the full list of 858 languages.

Note that this is a very inclusive view on what a language is: to just give a few examples, the list includes "de-formal", each of "uz", "uz-latn", and "uz-cyrl", etc.

Assigning ZIDs

[edit]

Here is the suggestion for assigning the first set of languages to ZIDs:

  • First, use the list of official working languages of the UN, and assign to them the first few ZIDs (this way there's a chance of memorizing these languages when working with the system). The order is alphabetic based on the code.
    • Z1001/Arabic (ar)
    • Z1002/English (en)
    • Z1003/Spanish (es)
    • Z1004/French (fr)
    • Z1005/Russian (ru)
    • Z1006/Chinese (zh)
  • Furthermore, starting with Z1011 and going to Z1861, we assign the other 852 languages that are in the full list of interface languages, based on the alphabetic order of their code. Any further languages will be added chronologically.

Alternatives: instead of using an alphabetic order, we could also use a hash or assign them randomly. Sorting them by the number of L1 or L2 speakers is difficult due to a lack of reliable statistics.