User:LA2/Corpus

This is the presentation according to my approved proposal for the Wikimedia CEE Meeting 2017 in Warsaw in September.

This presentation is aimed at wikipedians, people who already know the basics of wiki editing, but who might be new to Wiktionary and Wikisource and in particular to my novel approach to combine these two younger sister-projects of Wikipedia.

There is a video of the presentation, which does not exactly follow the text below. In this video, the following web pages are shown as examples:

statssekreterare in English Wiktionary
високосный год, skottår, and sats in Russian Wiktionary
screenshots below from старый in Swedish Wiktionary
my user page on Russian wiktionary and the play by Gorky
my user page on Ukrainian wiktionary and Taras Bulba by Gogol
mörker in Ukrainian Wiktionary
modernization of Swedish orthography of The Overcoat by Gogol

Parallel corpora based on Wikisource to support foreign language contributions in Wiktionary

Wiktionary, the free dictionary, and Wikisource, the free library, are two spin-offs from Wikipedia, created two and three years later than the free encyclopedia. Both were conceived as recipients of Wikipedia articles that had been created in its first years but were now considered not to be sufficiently encyclopedic. Instead of weeding out such articles, they were moved to new projects. This is not the best way for a project to be created, but with time both have succeeded to find their own way forward.

Wikisource (Викитека, Wikiźródła) is a repository of scanned books. For the purpose of this presentation, its main advantage over many other book scanning projects is that the text in Wikisource has been carefully proofread, so it can not only be searched, but also trusted to reliably present each word.

Wiktionary (Викисловарь, Wikisłownik) is an enormously ambitious project, aiming to document all words from all languages in each of its many instances. Each instance or site provides explanations in one home language. The Swedish Wiktionary provides explanations in Swedish to nearly 600,000 words from over 300 different languages, which makes it one of the twenty biggest instances of Wiktionary. More than 150 such instances have been started and 77 of them have more than 10,000 pages. These numbers are already large, but the aim is much larger. A really good dictionary would cover perhaps 200,000 words from each language. Theoretically, each instance of Wiktionary could contain hundreds of millions of pages. In that perspective, Wiktionary is still only in its very beginning.

One main difference between Wikipedia and Wiktionary is that each page in Wiktionary contains only a brief description of a word. Even a very short page can be sufficient, and it follows a fixed format with a limited set of headings. It is easy to be done with one article and continue to the next one. Instead of Wikipedia's tree of knowledge, where the "history of Africa" branches off into post-colonial history and the history of music in southern Africa, Wiktionary looks more like a meadow of many straws of grass: Africa, branch, colonial, history, meadow, music, northern, of, post-colonial, southern, tree.

Among the different instances (languages) of Wiktionary, there are some (English, Swedish, Danish) which use a minimalistic approach, where only a minimum of headings and templates are used, and others (German, Icelandic, Russian, Ukrainian, Polish) which use a formal structure of headings and templates even when there is no actual content under a heading. The first impression is quite different between the two groups, but if you look carefully at the actual content, the differences are not so large.

So who is attracted to Wiktionary? Isn't it foolish to join a project that is yet so far from its goal? Not at all! Languages and dictionaries are such, that a small number of words occur very frequently in any text, and a "long tail" of less common words appear very rarely. This is known as Zipf's law. A beginner is helped already by learning 20, 200 or 2000 words of a new language and with 20,000 words you can go through school.

I personally contribute to Wikipedia and Wiktionary mostly in areas that I am curious about, things that I'm currently learning, rather than those were I already have much knowledge. I look up facts or words and then document them in the growing online collection of shared knowledge. This is why, ever since I started to learn Russian three years ago, I have also been contributing to the knowledge of Russian language in Wiktionary, both Russian words in the Swedish Wiktionary and Swedish words in the Russian Wiktionary. Since each Wiktionary page follows the same structure, it is easy to copy the structure from an existing article and just fill in the blanks. You can contribute small parts, even if you are not yet fluent in a language.

Let's look now at a very simple, but still valid, entry in Wiktionary, the Russian adjective "старый" (meaning old) in the Swedish Wiktionary. The page carries the name of the word. The heading says this is a word in Russian, an adjective, and the explanation says it means "gammal" (Swedish for old). Both the explanation and the headings are in the home language, which is Swedish in the Swedish Wiktionary.

So how can this most basic entry be improved and expanded? There are many ways. We could describe the historic origin or etymology of this word. We could provide an illustration of something that is or looks old. We could provide comparative forms (older, oldest), synonyms (ancient) and antonyms (opposites: new, young) to the word, or other related words (olden, oldfashioned, oldie).

If you attended the Wikimedia CEE Meeting 2015 in Tartu, you might remember that I gave a lightning talk on how to record short audio files with the pronunciation of a word, for use in Wiktionary. So it might seem strange that the file Ru-старый.ogg is not used in this example. It has been corrected after this screenshot was made.

A most fundamental aspect of Russian adjectives is that they need to be inflected according to grammatical gender and case. A table of 4 columns by 6 rows is needed for this.

Yet another addition, that is the focus of this presentation, is example sentences that show how the word can be used in a context. For a foreign word, one from a different language than the home language, Wiktionary should also provide a translation of each example sentence.

A problem with example sentences, however, is that the contributor's imagination is limited and often comes up with very flat and boring examples such as "An old woman sits on a chair". These are very common in Wiktionary. Wouldn't it be nice if we could suddenly amend the short article to look like this?

In these four example sentences, four unique inflections of the adjective are used. That's not all twenty-four combinations, but good enough. All four sentences are written in good Russian, all are grammatically correct, and the translations are correct and written in good Swedish. None of the sentences look flat, boring or constructed. They are taken from different walks of life:

The first shows the nominative (basic) case in plural: the old lush birches of the garden.
The second sentence is an exclamation, almost vulgar: shut up, you old bitch!
The third is nicer, saying: she welcomed him as an old friend.
The fourth gives a analogy: like acid on an old and dirty coin, using the feminine accusative case (старую) due to the preposition "на".

I think that Wiktionary would indeed be much better if more articles could be improved in this way.

But there is one big problem: As a curious beginner in Russian, contributing to Wiktionary as I learn new words, how can I possibly come up with such nice example sentences? How can I know that they are correct? And that the translations are correct? It seems like an impossible dream.

Enter the magic search engine.

I just go to my user page on the Russian Wiktionary. There is a search box, where I enter the wanted word (either "старый" or "gammal"), and these example sentences come out, ready-made, with their already perfect translations. Wow!

I must admit, I did not create this alone. To my help I had four of the best experts in the field. Their names are Sigurd Agrell, Georg Procopé, Maxim Gorky, and Leo Tolstoy. The first two are Swedish translators, the other two are of course world-famous Russian writers. My example sentences are taken from the novel Anna Karenina and the play The Lower Depths.

My four helpers have this in common: All have been dead for more than 70 years. Their copyright has expired. All of their works are in the public domain. We are free to reuse them as we wish, and this is one such example of reuse.

The Russian original text to Anna Karenina (1873—1876) and The Lower Depths (На дне, 1902) can be found in the Russian Wikisource. The Swedish translations, published in 1903 and 1926, were digitized by me in Project Runeberg, a website separate from Wikisource but very similar.

It is of course possible to find a wanted word in a novel, and then to look up the same book, chapter, and sentence in a translation of that novel. But this is a tedious process for each word. How does that magic search engine really work?

On my user page on the Russian Wiktionary is a search box. This is created with the wiki code <inputbox>prefix=Участник:LA2</inputbox>. All it does is to search my user page and any subpages. My user page is a diary written in English, so it does not bring up any matches when you search for "старый" or "gammal". What I have added are subpages that contain text. There is one for Gorky's play The Lower Depths and another one for Tolstoy's novel Anna Karenina, which however, only contains a few chapters. There are a few more from other genres of literature. Each such subpage contains a table in wiki code having two columns, presenting a sentence on each line in Russian and Swedish:

старые кудрявые березы сада, обвисшие всеми ветвями от снега, казалось, были разубраны в новые торжественные ризы.	Trädgårdens gamla hängbjörkar stodo med hela sitt grenverk inhöljt i snö liksom svepta i en ny vit högtidsskrud.
¶ Он шел по дорожке к катку и говорил себе:	¶ Han vandrade fram på vägen mot skridskobanan och tänkte för sig själv:
«Надо не волноваться, надо успокоиться.	»Du får inte vara upprörd, det gäller att vara lugn. —
О чем ты? Чего ты? Молчи, глупое», — обращался он к своему сердцу.	Vad nu? Vad vill du? Håll dig stilla, du dåraktiga!» talade han till sitt hjärta.
И чем больше он старался себя успокоить, тем все хуже захватывало ему дыхание.	Men ju mera han ansträngde sig för att komma till lugn, desto värre stockade sig andedräkten i hans bröst.

Such a compilation of a text and its translation side by side, sentence by sentence, is called a "parallel corpus". To arrange the text like this is called to "align" the two texts.

"Parallel corpus linguistics" has great use in research, most often with very large volumes of text, where statistical models of languages can be built ("statistical machine translation", SMT, an application of "big data"). Google Translate is perhaps the best known application of this. In my case, I'm not interested in large volumes and statistics, but the individual example sentences, and so I don't need very large volumes of text.

To align the text is an editing task that takes some time, for sure, but it is a fun job for a learner of a language to edit a text in a language you don't fully understand, trying to find out which sentences correspond to each other. It sometimes happens that a translator leaves out a part of the text.

One particular text that is well suited for this is the Bible. Its books, chapters and verses are already numbered, as if prepared for alignment. The book of Genesis contains many basic words such as heaven and earth, darkness and light, etc. And the translations to various languages have already been carefully reviewed by many experts in language.

My Russian-Swedish parallel corpus now contains a total of 1183 kilobytes or roughly 390 thousand characters in each language originating from twelve source texts. The longest texts are Gorky's play The Lower Depths (На дне) and Gogol's short story The Overcoat (Шинель), which are presented in full.

To show that even a very small amount of text can yield useful results, I have also started a similar but much smaller Swedish-Ukrainian parallel corpus on my user page on the Ukrainian Wiktionary (85 kilobytes in all or 28 thousand characters in each language), using a few chapters from the Bible and from Gogol's novel Taras Bulba. Neither was originally written in Ukrainian, but Ukrainian and Swedish translations are freely available.