Jump to content

Research:Emerging Technical Communities

From Meta, a Wikimedia project coordination wiki

This page is to keep track of research work being done as part of technical contributors emerging communities metric definition project.

Project Goal

[edit]
Tracked in Phabricator:
Task T250284

The goal of this project is to identify wiki communities that are emerging that can benefit from more automation (bots/tools) when it comes to manage the growth of content. In order to find meaningful metrics that would allow us to find these emerging communities we did quite a bit of data exploration. The reader interested in the data exploration journey should firstly read the #Explorations section below.

Based on our findings we provide a set of next steps and recommendations for the Technical Engagement team.

Recommendation

[edit]

Base Assumption

[edit]

In the data exploration stage we looked a different ways to dissect the data, after some tryouts we excluded bots/editors ratio as an indicator of an emerging community. We learned that technical contributors mainly worked on content articles and acknowledged that the content size of a wiki is not correlated with its editing activity level.

As a result of our exploration we settle in identifying "emerging communities that might benefit from more tooling" by looking mainly at two variables:

  1. The amount of edits on that wiki that are done by non-bots and
  2. The amount of content pages that a wiki has.

We also look at the amount of distinct bots that edit that wiki as a dependent variable. Our basic assumption is that, for a wiki to be healthy, automation is needed once the number of content pages is over a certain threshold.

How to identify a community that might be underserved by technical contributors (bot/tool builders)

[edit]

We decided to choose "monthly non-bot edits" as the major metric to measure the need for automation/bots and "number of content pages" of a wiki as the secondary metric.

We first group wikis by its current number of "monthly bot editors" , "monthly non-bot edits" and "content pages". In this classification we look for outliers, wikis with a large number of content pages and a large number of "manual" edits but few bots, or wikis with large number of content pages but few edits and bots overall.

This might indicate a community that needs help developing tooling to be able to keep up with growth of their wiki. Before reaching out to the community at hand we need to look at other external markers like user pageviews for that wiki and overall edit history. These last two can be assessed via Wikistats. "Liveness" of talk pages is also an interesting one to asses whether there is a community behind the edits.


Number of Pages, Bot Editors, Nonbot Edits by Project
wiki_db content pages Avg of monthly bot_editors Avg of monthly nonbot_edits Might be underserverd
tumwiki 716 0 528
pntwiki 512 0 130
tywiki 1288 0 56
twwiki 767 0.5 346
dzwiki 294 0.5 99
ikwiki 669 0.5 96
kbpwiki 1800 0.5 93
satwiki 3823 1 2430 Y
lrcwiki 5528 1 1822 Y
smwiki 982 1 1091 Y
tyvwiki 2849 1 916 Y
pdcwiki 2054 1 333
iuwiki 536 1 239
gnwiki 3800 1 215
vewiki 447 1 202
rwwiki 1968 1 185
atjwiki 1207 1 148
rnwiki 704 1 134
kbdwiki 1610 1 130
koiwiki 3498 1 130
lbewiki 1252 1 124
jamwiki 1711 1 122
kgwiki 1254 1 117
kiwiki 1487 1 113
chywiki 788 1 99
xalwiki 2313 1 80
srnwiki 1175 1 80
sgwiki 292 1 59
tcywiki 1606 1.5 1515 Y
oswiki 12172 1.5 763 Y
fjwiki 966 1.5 550 Y
papwiki 2147 1.5 510 Y
omwiki 1101 1.5 428 Y
mrjwiki 10541 1.5 350
glkwiki 6271 1.5 252
biwiki 1393 1.5 230
arcwiki 1754 1.5 224
adywiki 556 1.5 207
mdfwiki 1351 1.5 132
gomwiki 4480 1.5 131
pihwiki 835 1.5 126
chwiki 542 1.5 124
ffwiki 330 1.5 112
bmwiki 750 1.5 104
nsowiki 8276 1.5 97
gotwiki 947 2 1027 Y
klwiki 852 2 836 Y
crhwiki 7281 2 791 Y
crwiki 131 2 534 Y
rmwiki 3709 2 440
ugwiki 6843 2 340
tnwiki 834 2 328
lezwiki 4076 2 291
bpywiki 25252 2 289
aywiki 5048 2 229
kvwiki 5487 2 204
avwiki 2512 2 185
krcwiki 2061 2 159
kswiki 457 2 157
towiki 1748 2 143
bxrwiki 2186 2 142
sswiki 533 2 131
novwiki 1789 2 118
tetwiki 1587 2 107
zawiki 2114 2 97
lgwiki 2384 2 92
quwiki 22374 2.5 798 Y
chrwiki 963 2.5 704 Y
akwiki 1163 2.5 646 Y
szlwiki 52141 2.5 624 Y
lfnwiki 4442 2.5 496
olowiki 3670 2.5 379
pcdwiki 4915 2.5 372
fiu_vrowiki 5636 2.5 296
udmwiki 4936 2.5 240
rmywiki 716 2.5 194
xhwiki 1379 2.5 190
stwiki 789 2.5 140
roa_rupwiki 1258 2.5 105
eewiki 384 2.5 103
piwiki 3204 2.5 69
nvwiki 14360 3 4275 Y
ruewiki 7616 3 954 Y
cuwiki 724 3 801 Y
acewiki 10426 3 531
napwiki 14673 3 522
iewiki 4889 3 416
roa_tarawiki 9275 3 358
abwiki 6119 3 354
ganwiki 6505 3 306
csbwiki 5363 3 282
gagwiki 2868 3 269
kshwiki 2869 3 246
kaawiki 2031 3 242
ltgwiki 1003 3 213
stqwiki 4086 3 199
tiwiki 379 3 182
wowiki 1610 3 160
tswiki 707 3 145
pagwiki 5065 3 114
dinwiki 283 3 108
minwiki 224155 3.5 3320 Y
gorwiki 6115 3.5 2877 Y
myvwiki 6096 3.5 2085 Y
snwiki 5738 3.5 1830 Y
extwiki 3344 3.5 1096 Y
nrmwiki 4538 3.5 900 Y
bjnwiki 3224 3.5 450
pflwiki 2669 3.5 330
miwiki 7190 3.5 311
newwiki 72938 3.5 253
nywiki 642 3.5 155
zeawiki 4756 3.5 131
lijwiki 3916 4 2410 Y
hywwiki 8135 4 1450 Y
mtwiki 3611 4 1369 Y
amwiki 15303 4 706
dvwiki 4307 4 684
scnwiki 26371 4 599
nds_nlwiki 6960 4 525
vlswiki 7180 4 521
hsbwiki 13598 4 379
mhrwiki 10249 4 356
cbk_zamwiki 3199 4 348
pamwiki 8801 4 308
nahwiki 7158 4 211
lnwiki 3284 4 193
gvwiki 5031 4 179
furwiki 3478 4 161
fywiki 43755 4.5 4224 Y
ilowiki 15116 4.5 2097 Y
emlwiki 12447 4.5 1371
zuwiki 2362 4.5 1267
dsbwiki 3276 4.5 1108
kabwiki 4773 4.5 869
maiwiki 14439 4.5 584
hakwiki 9470 4.5 582
tkwiki 7008 4.5 550
xmfwiki 13888 4.5 546
bowiki 11637 4.5 366
bat_smgwiki 16906 4.5 283
mwlwiki 4063 4.5 219
shnwiki 7712 4.5 215
jbowiki 1324 4.5 201
barwiki 31222 5 2976 Y
lmowiki 39569 5 1876 Y
frpwiki 3744 5 1346
pnbwiki 53355 5 1316
iowiki 29914 5 1296
hawiki 5200 5 1151
bclwiki 10600 5 1060
vepwiki 6467 5 991
angwiki 3265 5 881
pmswiki 64718 5 745
lowiki 4499 5 554
fowiki 13335 5 461
kwwiki 4002 5 337
tpiwiki 1646 5 186
nawiki 1506 5 134
bugwiki 14180 5 119
cvwiki 43102 5.5 2784 Y
suwiki 60170 5.5 1121
liwiki 12923 5.5 922
sawiki 11546 5.5 508
sewiki 7823 5.5 439
igwiki 1569 5.5 378
map_bmswiki 13694 5.5 291
inhwiki 1341 5.5 177
mnwiki 21412 6 4568 Y
alswiki 27180 6 3386 Y
azbwiki 239129 6 2835 Y
wuuwiki 32388 6 2516
kmwiki 9981 6 2091
frrwiki 12261 6 1935
htwiki 59536 6 1682 Y
scwiki 6733 6 1449
vowiki 124531 6 1431 Y
hifwiki 10105 6 1116
mznwiki 13456 6 686
ladwiki 3560 6 227
ocwiki 87024 6.5 12238 Y
lbwiki 58269 6.5 8042 Y
newiki 34726 6.5 5443 Y
cewiki 254526 6.5 4697 Y
kywiki 80854 6.5 2040
wawiki 14012 6.5 1841
warwiki 1264396 6.5 1834
pswiki 12665 6.5 1306
yiwiki 15139 6.5 1226
cowiki 5852 6.5 583
gdwiki 15175 6.5 476
hawwiki 3973 6.5 436
mgwiki 92969 7 8254 Y
diqwiki 15635 7 5882 Y
brwiki 68214 7 4652
zh_classicalwiki 10371 7 3217
jvwiki 57874 7 2098
sowiki 7101 7 1283
cdowiki 15467 7 203
mkwiki 105766 7.5 75476 Y
kuwiki 28023 7.5 7172 Y
bawiki 51988 7.5 5807
ndswiki 64924 7.5 4137
iswiki 49790 7.5 4000
gawiki 52812 7.5 3291
sdwiki 15163 7.5 2649
yowiki 32719 7.5 1513
zh_min_nanwiki 405551 7.5 1109
guwiki 29141 8 3455
siwiki 19834 8 3095
pawiki 36239 8 2935
iawiki 22362 8 2221
sahwiki 13673 8 1366
slwiki 168230 8.5 15410
mrwiki 57781 8.5 12535
vecwiki 23748 8.5 6565
knwiki 26679 8.5 4676
swwiki 58794 8.5 4166
anwiki 37270 8.5 3441
orwiki 15734 8.5 2104
bhwiki 7102 8.5 1868
zh_yuewiki 83593 9 17333
kkwiki 230591 9 14666
uzwiki 135629 9 7790
cebwiki 5378840 9 2978
tgwiki 100942 9 1926
eowiki 279486 9.5 20873
bswiki 82666 9.5 15327
mlwiki 70020 9.5 13154
lawiki 132745 9.5 8519
nnwiki 152455 9.5 6955
tewiki 70038 10 17618
ltwiki 199439 10 17479
aswiki 6734 10 6951
tlwiki 72043 10 5347
etwiki 208492 10.5 28644
afwiki 90635 10.5 13853
shwiki 451628 10.5 9586
mywiki 48767 10.5 5290
lvwiki 101446 11 13286
ckbwiki 26234 11 9130
ttwiki 89605 11 5137
arzwiki 463039 11.5 179321
hywiki 269598 12 57729
azwiki 158810 12 38740
glwiki 163664 12 23183
elwiki 177802 12.5 66828
mswiki 339121 12.5 27868
kawiki 136978 13 25684
skwiki 233263 13 23350
sqwiki 88501 13 10853
be_x_oldwiki 70083 13 7961
cywiki 130734 13 6486
scowiki 57108 13.5 5783
bgwiki 262016 14 44673
hrwiki 217717 14 26278
astwiki 107097 14 16710
thwiki 137526 15 58176
bewiki 190241 15 21483
dawiki 259037 15.5 31229
tawiki 133938 15.5 16676
hiwiki 141852 16.5 50048
bnwiki 88079 17.5 79484
fiwiki 484396 17.5 70931
srwiki 633849 17.5 62021
euwiki 356115 18.5 40484
urwiki 155983 19 15506
viwiki 1245372 20.5 143494
rowiki 408401 21 51774
nowiki 533468 21.5 74627
trwiki 351245 24 279516
idwiki 530593 25 106620
huwiki 469697 25 104727
cawiki 645863 25.5 98543
simplewiki 161360 25.5 63158
ptwiki 1030618 29 353946
hewiki 265017 29.5 282846
svwiki 3731172 30.5 129248
plwiki 1408471 33.5 256043
nlwiki 2009297 33.5 229962
ukwiki 1013285 34 229973
cswiki 453435 34 128027
arwiki 1042259 35.5 252587
zhwiki 1116711 37 518090
kowiki 493669 38 228542
fawiki 723024 39 230308
jawiki 1204380 43.5 515075
itwiki 1604413 46.5 598670
eswiki 1596408 51.5 987905
ruwiki 1620370 58.5 632978
dewiki 2429468 62.5 908090
frwiki 2211038 69 1071503
enwiki 6071412 319.5 5692081

What is the ideal number of bots?

[edit]

Given the relationship between non-bot edits and bot editors is not linear (Figure 1, Figure 2) we use percentiles to define the suggested number of bot editors to match its editing activity level.

Figure 1: Correlation of non-bot edits and bot editors (data timeframe: 2020-04-01~2020-04-30)
Figure 2:  Correlation of non-bot edits and bot editors in low value area (Data timeframe: 2020-04-01~2020-04-30)
Table 1: Percentile of monthly nonbot edits, content page, monthly bot editors (Data timeframe: 2020-04-01~2020-05-31)
Percentile Nonbot_edits in April Nonbot_edits in May Content page Avg of monthly bot_editors
0.25 221 246 2869 2.5
0.5 962 983 10541 5
0.75 6250 6384 82666 8.5
1 5621913 5762249 6071412 319.5

* Metric definition:
Nonbot Edits: number of edits made by users who are not bot (by user group or by user name) in the given month. Edits that have been reverted or deleted are included.
Content page: The total number of existing (non-deleted) pages in content namespaces across all wikis.  
Monthly bot editors: The number of bots (by group or by name) that have edited in the given month.
Average of monthly bot editors: an average of  monthly bot editors in 2 continuous months. Bot editors in emerging communities fluctuate month by month. An average of 2 continuous months shows an overall bot active-ness in that community. The data in Table 1 is the average of April 2020 and May 2020.


The Table 1 shows the 25th, 50th, 75th and 100th percentile of each metric. The percentile of non-bot edits in two consecutive months (April 2020 and May 2020) is very consistent. The 25th percentile is 200+ edits, 50th percentile 900+ edits, and 75th percentile 6000+ edits. Our suggested ideal number of monthly bot editors for each percentile group is simplified as shown in Table 2. For a community which has 6000+ monthly non-bot edits, the ideal number of monthly bot editors is 9. For a community which has 900+ monthly nonbot edits, the ideal number of monthly bot editors is 5. For a community which has 200+ monthly nonbot edits, the ideal number of monthly bot editors is 3.

Table 2: Suggested ideal number of bots
Percentile Monthly nonbot edits Content pages Suggested Ideal monthly bot editors
0.25 200 2800 3
0.5 900 10000 5
0.75 6000 80000 9

Explorations

[edit]

The Technical Engagement team had a few questions about technical contributors in wiki communities. While the definition of technical contributors includes a variety of contributions in very different technical areas, this research focus on contributors who write tooling to help with edits on a wiki. This tooling is normally referred to as "bots", which are automated scripts that run on our cloud environment that patrol Wikipedia doing tasks like, for example, removing vandalism by reverting edits.

Is the ratio of bots/editors high in emerging communities but low on established communities?

[edit]

Comparing the ratio of bots/editors in emerging communities and established communities, it seems that a high bots/editors ratio is not a strong indicator that the community is an emerging community. Established communities tend to have a low bots/editors ratio as they usually have a large number of human editors. However, in some cases, some emerging communities could have a low bots/editors ratio when the number of bots is really very small. For example, in Table 3,  German Wikipedia (dewiki),  an established community, has 0.09% bots/editors rate. Hindi Wikipedia (hiwiki), an emerging community, has 0.1% bots/edits rate. The bots/editors ratios are very close even though those two wikis are in different development stages.

Table 3: Number of editors, edits, pages, and bot/editor ratio per Wikimedia project*        (Data timeframe: April 1, 2020 through April 30, 2020 )
wiki_db editors bot_editors bot_editor_ratio edits content_pages
dewiki 80531 71 0.09% 168897 2429468
hiwiki 15449 15 0.10% 20805 141852

* Metrics definition:
Editors: number of registered users who made edits on the given wiki in the given month.
Bot Editors: number of users who are bots by user group or by user name and made edits on the given wiki  in the given month.
Bot Editor Ratio: bot editors/editors
Edits: number of edits made on the given wikis during the given month. Edits that have been reverted or deleted are included among total edits.  
Content pages: number of existing (non-deleted) and non-redirected pages in content namespaces on the given wikis.


Figure 3 is the scatter diagram of  bots and editors on all Wikipedia projects. Figure 4 is a zoom-in of the low value area.  Dots in the upper-right corner present the established Wikipedia communities. Dots in the lower-left corner present the emerging Wikipedia communities. Figure 3 and Figure 4 show bots and editors do not have a linear relationship. The bot/editor ratio could be the same in high value area and low value area. Therefore, bots/editors ratio is not an ideal indicator for us to identify the community is an emerging community or an established community. It cannot be the metric to measure whether a community has enough tooling to thrive.

Figure 3: Correlation of editors and bot editors (Data timeframe: April 1, 2020 through April 30, 2020 )
Figure 4: Correlation of editors and bot editors in low value area (April 1, 2020 through April 30, 2020 )

What are the bots doing? What are the types of their contributions?

[edit]

The spreadsheet includes bot edits by namespace across all projects from 2020.01.01 to 2020.05.31. It shows that 65.8% bot edits are for content pages on all projects. The content bot edit rate of content pages by bots varies between 0.02% and 100%. I listed a few interesting cases in Table 2. On English wikipedia, 49.87% bot edits are content edits. On wiki commons, 97% of bot edits are file edits. On Wiktionary and Wikidata, bots mainly focus on content editing.

Table 4: Bot edits by namespace * (Data timeframe: 2020.01.01 ~ 2020.05.31)
project project_family Category Category talk Content File File talk Help Help talk MediaWiki MediaWiki talk Other Project Project talk Talk Template Template talk User User talk Grand Total Content Edits%
en.wikipedia wikipedia 91833 16781 1962241 89944 991 50 399 37 313 38016 764770 15027 209610 45891 10989 496897 190703 3934492 49.87%
ar.wikipedia wikipedia 596209 80865 2619849 7552 7 77 5 0 4 5475 30403 179 93504 125069 22977 17220 1328785 4928180 53.16%
commons.wikimedia commons 331504 388 5021 20624623 1543 160 0 29 48 27676 91794 1001 143 6975 63 224111 34532 21349611 0.02%
ca.wiktionary wiktionary 0 0 40412 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40412 100.00%
www.wikidata wikidata 5 0 54007654 0 0 4910 10 32 36 744939 264339 745 923 2107 7 91969 1038 55118714 97.98%

*Metric definition:
Bot Edits: number of edits made by users who are bot by user group or by user name in the given month. Edits that have been reverted or deleted are included.

What’s the metric to identify the community which needs more technical supportive work?

[edit]

Given the function of bots, a community with a high volume of edits or existing content which needs to maintain will likely need more bot support. Mapping to some measurable metrics, the possible metrics could be the number of monthly edits and total content pages.  Considering that the number of monthly edits inflated by existing bots, I chose non-bot edits to reflect the amount of organic edits. I also observed that the monthly non-bot edits are not correlated with total content pages in some communities. Those outliers in Figure 3 represent the communities  which have a large number of total content pages but are at low monthly editing level now.

Figure 5: Correlation of non-bot monthly edits and content pages (Data timeframe: 2020-04-01~2020-04-30 on all wikipedia projects)

* Metrics definition:
Non-bot edits: number of edits made by human users, who are not bot by user name or group,  on the given wikis during the given month.
Content pages: number of existing (non-deleted) and non-redirected pages in content namespaces on the given wikis.







Take a look at one of the outliers, newwiki ( Newari Wikipedia). It has more than 60 thousand content pages, considered as a medium size Wikipedia. But from history we can see the pages are mainly created by bots. The number of non-bot edits has never grown. When bots are not active in newwiki, the monthly edits keep flat at a low level. For such a community which does not have many organic editors, should we provide more bot support? I have no answer for it yet. But it makes me choose monthly non-bot edits as the major metric to measure the needs for bots.

Figure 6: History of content edits, bot content edits, total content pages on newwiki
Figure 7: History of editors on newwiki
Figure 8: History of bot editors on newwiki

When a wiki community needs to start thinking about bots? How does the editor trend correlate with the growth of bot editing?

[edit]

* Data timeframe: 2001~2020-05-31
* Metric definitions:
Total content edits: number of content edits made in the wiki during the given month.
Bot content edits: number of content edits made by bot users by group or by name in the wiki during the given month.
Total content pages: the cumulative total number of content pages created without being deleted by the end of the given month.
Editors: number of registered users who made edits in the given month in the given wiki.


We build superset dashboard to explore this data (WMF internal only): https://superset.wikimedia.org/r/263

We studied ruwiki (medium size wikipedia), rowiki (small size wikipedia) and  svwiki (large size wikipedia).







On ruwiki , when bot editing became active ( > 1k) in September 2004, the number of editors was 268 .

Figure 9: History of content edits, bot content edits, total content pages on ruwiki
Figure 10: History of editors on ruwiki

On rowiki, when bot editing became active ( > 1k) in July 2005, the number of editors was 134 .

Figure 11: History of content edits, bot content edits, total content pages on rowiki
Figure 12: History of editors on rowiki

On svwiki, when bot editing became active ( > 1k) in June 2005, the number of editors was 624.

Figure 13: History of content edits, bot content edits, total content pages on svwiki
Figure 14: History of editors on svwiki

Among the three wikis, only ruwiki has a stable monthly editing pattern (in terms of non bots edits). Svwiki and rowiki still rely on bots to create edits. It seems the growth of human editors is not correlated with the growth of bot editing. Also there is no clear answer to the question of when is best to introduce bot editing into the community. Wikis have different growth trajectories for many reasons.

Next Steps

[edit]