Abstract Wikipedia/Updates/2022-03-14
◀ | Abstract Wikipedia Updates | ▶ |
How would we generate a text in Abstract Wikipedia such as the first sentence of the English Wikipedia article about Mariya Zerova?
Mariya Yakovlevna Zerova, alternately Marija Jakovlevna Zerova, (April 7, 1902 – July 21, 1994) was a Ukrainian biologist and taxonomist known for her work in mycology.
There are plenty of interesting questions regarding generating this short sentence - the name, the biographical dates, the description. Today, let’s just focus on the name.
Given that Zerova was Ukrainian, was born in and has lived in Ukraine, her name was written using the Cyrillic alphabet, “Марія Яківна Зерова”. In her English Wikipedia article, her name in the Cyrillic alphabet is given in the Wikipedia infobox, but not in the text of the article. There are several ways to transliterate the name from the Cyrillic alphabet to the Latin alphabet. Particularly, the letter я can be transliterated as “ya” or “ja” in English, which leads to the variation given in the English Wikipedia article.
Her Wikidata item states that her first name in English is “Marija”, and not “Maria”, “Mariya”, or “Mariia” (all these three names are written as “Марія” in Ukrainian). Names are a difficult mess, and so it is not surprising that Wikidata is having trouble representing them. A big thanks and shoutout to the hard work by the Wikiproject Names on Wikidata, which aims to sort out this kind of issue. You should join them if you are interested in helping.
So, how would we get her name for Abstract Wikipedia for the different languages? Do we need Lexemes for every first name in every language? Such as the Lexeme “Maria” in English? And then how would we link the given name in Wikidata to the given name, and in turn the Lexemes link to that given name?
What about “Yakovlevna”, her patronym? Or “Zerova”, her family name? Both names are rarer than “Mariya”. Would we expect Lexemes for each of these names in Wikidata too, for each language individually? That seems like a lot of work.
In such cases I hope that the answer is no, and that we can figure out a way to avoid that. But what could that look like? As usual, I expect that as a community we will come up with a better solution than what I could come up with. Together we are smarter than any one of us. So think of this as a first, rough draft.
My first thought would be to have functions in Wikifunctions that take a name such as “Yakovlevna” as a string and can generate all necessary forms based on regular morphological functions. Names that have irregular forms would still be Lexemes, but if a function can create the necessary forms, we should be able to use that directly based on a string. So if we need the genitive form of “Yakovlevna’s” name (as in this very sentence), a function would just generate it.
The same mechanism to generate the necessary forms may be helpful for many place names and other proper names. In addition, we will likely need functions that can transliterate between different alphabets, which is a hornets’ nest in itself. Transliterations can differ from target language to target language: the transliteration of “Зерова” into German would be “Serowa”, not “Zerova” as it is in English.
But that’s not all. The astute reader might have already noticed that “Yakovlevna” is not a direct transliteration of “Яківна”: that would be “Yakivna” (or “Jakivna”). What happened here?
In addition to the name being transliterated (i.e. where we map from one script to another) the name was also translated, or backformed, in the way it would be formed in Russian. The English form “Yakovlevna” is based on the Russian form “Яковлевна”, and indeed, if we look in the Russian Wikipedia, the Russian name for the biologist is “Мария Яковлевна Зерова” — a version of the name that is never mentioned on her native Ukrainian Wikipedia article.
By the way, if you are surprised to find that names can be translated, enjoy seeing the names of “Pope John Paul II” in different languages on Wikidata by clicking on “All entered languages”.
How would Abstract Wikipedia ever figure out that it should first translate “Яківна” to Russian and then transliterate it? Is this even the right thing to do? To be honest, I am entirely stumped here. Should Ukrainian names in general first be translated to Russian variants, and then be transliterated? Let’s take two other Ukrainians, who both have the same name: the President of Ukraine, and the brother of the Mayor of Kyiv, are both named “Володимир”, but English Wikipedia refers to the President as “Volodymyr” (a direct transliteration) and to the other as “Wladimir”. In Ukrainian, they have the same name!
I guess in many of those cases the best we can do is to rely on Wikidata, and use the labels on the items as string input and the structured data around given and family names. This allows us to enter and fix the data manually, item by item, where there is evidence that an individual used a different form. Only if Wikidata does not offer the necessary data, would we need to use fallback functions. And the fallback functions could be different from language to language, so that “Zerova” can be “Яковлевна” in Russian, and “Яківна” in Ukrainian.
And maybe, just maybe, having to encode that explicitly will make us more aware of how names of people and places flow through our knowledge ecosystem, how they reflect power and inequity.
So many interesting things about just the first few words of this one sentence, and we haven’t even talked yet about whether her birth date is stated in the Gregorian, Julian, or another calendar!