Talk:Abstract Wikipedia/Updates/2021-04-15

English-only oriented choice

Latest comment: 4 years ago3 comments2 people in discussion

It is clear that the selection was largely based on a selection of countries where ONLY English is an official language. And it's unfortunate that these languages have low variety of aspects, meaning that they won't be sufficient for correct coverage of language features:

no RTL language
simple plural rules
simple scripts (yes Bengali-Assamese and Malayalam are simple)
nothing about agglutinating languages
nothing about mutating languages (simple morphology): texting things like lexeme forms will still not work for all the desired languages (and still not for many minority languages that the project aims to cover)

So this can jsut be an initial testbed, for an alhpa version, but you'll need more work (or redesigns, including for the data model!) to correctly cover a significant number of languages (including languages spoken today by tens of millions of natives, and major languages that risk now to reject the project completely). The risk is a split or a completely different solution designed elsewhere (possibly proprietary).

So I hope that this selection will not stay as is. In my opinion

Bengali+Malayalam has one too-many options, and choosing Hindi+Urdu would have been a better choice, covering also the difficult case of weak conversions of scripts
or just Hindi, if another RTL language had been chosen (like Hebrew, Arabic or Persian, possibly Maldivian, with coverage of the additional difficulty caused by their optional diacritics and collation and correct coverage of BiDi).
Another language should have been in the Indo-Malaysian group (such as Javanese) or the larger group of Malayo-Polynesian family (such as Malagasy or Maori)
Probably a Quechuan language should have been present (with the hope that the Spanish support would come rapidly to support their efforts)

There's no doubt that the development in other major languages (notably French, German, Spanish, Russian, possibly Portuguese) will appear alkmost immediately in parallel with English, as long as English is not just a simple "demo" but is correctly maintained in sync, but it would have been preferable to allow another reference language than jsut English (which is very defective in many aspects, compared to other official UN languages: French, Spanish, Arabic, Russian, and Simplified Mandarin).

But for testing and developping "small" languages, you really need other languages than only those spoken in countries where English is official: you will still not expand the audience, and you'll keep the bias once again increased in favor of English.

So I hope that very soon, the early developments initiated in these 4 languages will also find their support by themselves in other languages that will be used to join other small communities : notably French/Spanish/Portuguese (and many related "smaller" communities including Italian, Corsican, Romansh, Romanian, or Catalan), Russian (maybe related to Polish, Ukrainian, Serbian, Bulgarian, Macedonian, Czech, Slovak and Slovenian), Turkish (maybe Azeri, Turkmen, Mongolian), Greek, Armenian, Georgian, Persian (maybe related to Uzbek and Pashto), Arabic, Wolof, Amharic, Indonesian (maybe related to Malay, Javanese, Malagasy), Hindi (maybe Urdu and other Indo-Aryan languages of India, and Nepalese, Bhutanese, or Tibetan), Vietnamese, Thai, Khmer, Burmese/Myanmar, and Chinese (maybe Korean and Japanese which have their own pecularities, even if they are not linked to many other small communities). Beside that, some other "smaller" languages like Finnish and Hungarian require more specific efforts (with lower possibilities of help by other major languages, except for regional languages support in countries that have another major language already official and well supported).

Some small languages (like Basque, native Amerindian languages, Celtic languages) will still be challenging as they are much more isolated with lower counts of native speakers, and very different from the other major language with which they coexist on the same territory only as a minority regional language with a nationally low level of official support (and sometimes the absence of recognized academic sources or standardisation of their orthographies and good understanding of their grammatical requirements or specific features, and for many of them we severely lack open access to a large enough and productive bibliography).

I do not doubt that German will be supported by the existing Wikidata team itself at Wikimedia Deutschland (but it won't be difficult to support it as well for Dutch, Afrikaans, Danish, Swedish, and Norwegian). This will be needed to get correct coverage in Northern and Western Africa, as well as Southern and Eastern Asia, and have a less Germanic influence (with many false assumptions made in early developments that could be very costly to solve later, so that the project will in fact no longer help "closing the gap" but could in fact increase it).

-- verdy_p (talk) 00:28, 17 April 2021 (UTC)Reply

The selected focus languages were chosen from those which applied and aligned with the criteria.

Re: It is clear that the selection was largely based on a selection of countries where ONLY English is an official language. -- As far as I can tell, English is not an official language in Bangladesh or Niger. Either way, it was not a data-point that we looked at, nor part of the criteria.

It will all expand over time. Quiddity (WMF) (talk) 23:37, 20 April 2021 (UTC)Reply

Bengali is a language of India too, with strong relations, and in msot of India, English is used instead of Hindi that many states don't accept (this also applies to Malayalam), but English is more often a lingua franca. Niger is small compared to Nigeria where Hausa is most prevalent.

But one important point is the bias in coverage for distinctive language features, in their grammar (plurals, cases, genders, syntax), and scripts (where is RTL?), and the way words or morphemes are composed and/or modified (notably by contextual mutations). A lot more efforts will be needed to cover essential cases needed, and have enough coverage of lingua fracas that will help smaller communities to startup and coordinate their efforts (notably in Africa but more importantly in Central Asia, Southeastern Asia, and Eastern Asia). Designing the Abstract wikipedia without taking into account early enough the needed features will become a severe blocking issue later (in the Abstract Wikipedia for modeling languages, but as well in the data model for Wikidata: the current design of lexemes and forms does not fit well with many languages, and the support for mutations and derivations is too minimalist to be useable for many languages, as well it does not take into account many things that have been better modelled in Wikitionnaries:

May be Wikidata is the wrong place to develop these, and better coordinations should have been to make Wiktionnaries interoperate without having to duplicate the efforts and data in Wikidata: the statistics of coverage in Wikidata are instructive when you compare them with the current coverage and growths in Wiktionnaries). But unfortunately, the current gaols has discared Wiktionnary as a viable source of data, and too much is demanded into Wikidata. We already see contradicting statements, and now huge difficulties to reconciliate the efforts. The same remark applies to Wikispecies (now relegated to a legacy project: Wikispecies has now better competitors in other open data projects, but not in Wikimedia, see the Encyclopedia of Life for example; clearly we need way to empower existing project with more database-capacilities and modelisation; functions will jist be a superficial layer allowing the adaptation to create "connectors" or interfaces, without needed to duplicate the data and efforts and then solve the contradictions). verdy_p (talk) 21:50, 22 April 2021 (UTC)Reply