Movement Insights/Wiki comparison
The wiki comparison tool shows a simple, snapshot comparison of our wikis (unlike, say, Wikistats, which is meant to show simple trends within individual wikis or wiki groups).
Suggested statistics
[edit]This page lists suggested additional statistics for the wiki comparisons dataset. Please add your own ideas or vote for ideas that are already here!
Statistic | Notes | Votes |
---|---|---|
Mobile retention rate | Needs some data infrastructure (e.g. editor_day table) | Neil, Jan |
Uses language variants? | Not sure what the best way to get this list this is | |
Fundraising revenue | Would help us understand places where major reader-facing changes could potentially impact fundrasing | |
Number of associated countries | Probably based on the percentage of editors and readers from the country; needs analysis to decide where to set the threshold; perhaps should be split into major and minor countries | Jan, Dana |
Top country | Jan, Dana | |
Ratio of top associated country to second associated country | Jan | |
wiki creation date | In progress: T336999. | |
monthly registration | ||
very active editors | editors with 100 or more content edits in a month | Jan, Denis |
uses flagged revisions | ||
has ArbCom | Relevant profiles https://office.wikimedia.org/wiki/Trust_and_Safety/ArbCom_maps
Overall https://meta.wikimedia.org/wiki/Arbitration_Committee |
Jan, Sydney |
has oversighters | https://meta.wikimedia.org/wiki/Oversight_policy#Requests_for_oversight | Jan |
has checkusers | https://meta.wikimedia.org/wiki/CheckUser_policy#Access_to_CheckUser | Jan |
Global South/Emerging Communities traffic percentage | Jan, Dana | |
speakers to editors ratio | Would require collecting (above) speaker population, but that will be very difficult to gather for anything more than top 20 or so languages; Ethnologue sells the best dataset | Jan |
speakers to mobile devices ratio | also requires speaker populations | Jan |
language health | Ethnologue's dataset also includes a measure of language vitality: https://www.sil.org/about/endangered-languages/language-vitality | Neil |
number of user accounts blocks - by reason | Community health initiative looked at one week of total block # by wikis https://docs.google.com/spreadsheets/d/1_4GZ2WUurxaehlNeab5mF7VDgOgHd-PXjFjsLD5tY2Q/edit#gid=1703473757- good data but needs to be separated by reason to be more meaningful | Sydney, Jan |
talk page edits proportion | ||
number of talk page editors | ||
median article quality | velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
volume in important articles | velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
articles injected from translation | velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
multimedia (by type) coverage | velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
mobile Android app edits | this is to just be more granular. velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
mobile iOS app edits | this is to just be more granular. velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
mobile web edits | this is to just be more granular. velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
citation (by type) coverage | velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
external referer count | velocity and acceleration would also be interesting if can be directly queried with ease | Adam |
Logged in page views, mobile and desktop | distribution of page views between logged in and non-logged in users | Margeigh |
% of editors who edit other wikis | might help distinguish e.g. Meta, MediaWiki.org, Commons from other wikis | |
new content pages | ||
Link this to country-specific data (like that in https://docs.google.com/spreadsheets/d/1AMUiZ4z3CCSBClmEU8T6JpASJd2_D4vfR674B2c6NCg/edit#gid=0). It won't be very hard to decide which countries are associated with which wikis, but it's not clear how to weigh country data back up into a per-wiki value. | Neil, Adam | |
Median page load time or other connection quality metric as recommended by the Performance team (e.g. like this but per wiki instead of per country: https://commons.wikimedia.org/wiki/File:Median_Wikipedia_page_load_times_by_country_(desktop%2Bmobile,_enwiki,_Dec_2015-Jan_2016).svg ) | Tilman, Quiddity | |
Anon edits broken down by platform (mobile web, mobile app, desktop) | Rita | |
Registered users | Ease sorting and looking up. | |
Community growth % | For the last three years (e.g. from 2019-2022), the last year (2022-2021) and the year before (2021-2020). | Denis |
Administration
[edit]The code generating the data snapshots lives at github.com/wikimedia-research/wiki-comparison.
We update this dataset about every year, although it can be updated as often as we choose. When we do, we generally continue to make the old versions accessible in other tabs in the Google Spreadsheet), so people can perform comparisons if they wish. However, this is a "bonus" feature and we do not guarantee that old snapshots will be kept.
The update process is as follows:
- Generate a new snapshot by running data-collection/data-collection.ipynb. Note that you should always select for
SNAPSHOT
the latest month with a completed mediawiki_history snapshot; a snapshot ending in December is no better than one ending in March. - Put up your changes as a pull request.
- Make a copy of the wiki comparison spreadsheet.
- Add the new snapshot to the copy by making a copy of the previous sheet and then pasting the new data (Edit > Paste special > Values only) in the columns from B to the end. This preserves all the nice formatting.
- Have someone review the data in the new snapshot as well as any changes you made to the generation code.
- When satisfied, the reviewer merges the pull request.
- Copy the new snapshot to the main spreadsheet. Make sure to protect the whole sheet (Data > Protect sheets and ranges) with the "Can edit (with warnings)" level. This ensures that those with write access to the spreadsheet do not accidentally change data or filter the data for everyone when they are just trying to use the tool for themself.
- Announce the new snapshot to
#general
on the Wikimedia Foundation Slack. Include some background on what the tool is for, to entice new folks to use it.
Background
[edit]The following was written in March 2024 to share background on the wiki comparison tool in a Wikimedia Foundation staff news letter:
The wiki comparison tool is a public Google Sheet where you can sort and filter the huge list of Wikimedia wikis using 26 different dimensions, from the retention rate of new editors to the percentage of page views that come from mobile apps.
It originated in 2018 as part of an effort to develop standard categories and clusters of wiki to help Foundation staff to target programs and drill into metrics. Unfortunately, most of the project was canceled, but not before the Product Analytics team achieved the first step of making a simple tool to allow people to explore the source data it had collected: the wiki comparison tool.
Making the tool was actually a relatively simple task for Product Analytics. Almost all the statistics had already been calculated in one place or another (surprisingly, one of the hardest parts was actually collecting the English names for languages and wikis), and Google Sheets provided a near-perfect, zero-maintenance interface.
Despite the simplicity, many of the statistics it collects aren’t available anywhere else without manual querying by a data analyst. In an average month, about 25 users consult it at least once (the record is 122, in February 2022, when that year’s update was announced). The tool also birthed the canonical wiki dataset, which makes it easy for researchers and analysts to connect different data sets and to translate codes like “euwiki” into names like “Basque Wikipedia”.
Since its creation, the maintainers (previously Product Analytics, now Movement Insights) have updated the data each year and added a handful of new metrics. There aren’t any concrete plans to develop it further, but there are plenty of dimensions that could be added.
The tool serves the same purpose as the many tables of wikis on Meta, which are maintained by volunteers. While wiki comparison has a more useful interface and more metrics (many of which are extremely difficult to calculate without our internal data infrastructure), most of the tables on Meta are bot-updated several times a day, so they’re much faster moving. The project team hopes that in the future, these approaches can be unified so everyone in the movement gets the best of both worlds.