User:MPopov (WMF)/Notes/Incubator test wikis
Motivation: Caroline's question posted in #working-with-data in Slack
Question: How many RTL languages have test wikis in the Incubator
Answer:
Language directionality | Languages with 1+ test wiki(s) |
---|---|
Cyrillic (LTR?) | 1 (neg) |
Vertical (Letters: TTB, Lines: LTR) | 3 |
Left-to-right | 576 |
Right-to-left | 31 |
Caveat: These counts only include test wikis which satisfy the following criteria:
- They are substantial (having at least 25 mainspace pages), and/or
- They are active (having some active mainspace pages creation since the beginning of 2023).
Methodology
[edit]The data will come entirely from https://incubator.wikimedia.org/wiki/Incubator:Wikis which we will "scrape" using JavaScript and analyze separately.
When the page loads all the test wikis are collapsed/hidden. Each one can be expanded by clicking "[show]" link which will load the testwiki information. We can show/expand all of them by running the following JavaScript code in console which will trigger the click event on each [show] link:
$("td a.att-toggle:contains('[show]')").each(function(index) {
$(this).click()
})
It will take a minute or two to load all of the testwikis' information.
Now that they have all loaded, we can extract the ISO 639-3 language code and directionality info for each testwiki, storing those two pieces of data in the array testwiki_languages
:
var testwiki_languages = [];
$(".testwiki-language").each(function(index) {
var lang = {
iso_639_3: $(this).find("kbd a").text(),
directionality: $(this).find("ul li:contains('Directionality')").text().replace('Directionality: ', '')
};
testwiki_languages.push(lang)
})
To get that data into R or Python, we need to stringify it into a JSON representation:
JSON.stringify(testwiki_languages)
We can copy the output by right-clicking in the console and selecting Copy Message. For data analysis in Python you would use:
import pandas as pd
testwiki_languages = pd.read_json('[{"iso_639_3":…]')
But we are going to do the analysis with R:
library(jsonlite)
testwiki_languages <- fromJSON('[{"iso_639_3":…]')
(In both of these cases the full and rather lengthy string is omitted.) Finally, let's count languages by directionality – keeping in mind that due to how we compiled our dataset it will have duplicates of languages if there are multiple projects incubating for a language (e.g. Moroccan Arabic Wikibooks, Moroccan Arabic Wikiquote, Moroccan Arabic Wiktionary):
library(tidyverse)
testwiki_languages |>
distinct(iso_639_3, directionality) |>
count(directionality)