Research:Emerging Technical Communities
This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation. |
This page is to keep track of research work being done as part of technical contributors emerging communities metric definition project.
Project Goal
[edit]The goal of this project is to identify wiki communities that are emerging that can benefit from more automation (bots/tools) when it comes to manage the growth of content. In order to find meaningful metrics that would allow us to find these emerging communities we did quite a bit of data exploration. The reader interested in the data exploration journey should firstly read the #Explorations section below.
Based on our findings we provide a set of next steps and recommendations for the Technical Engagement team.
Recommendation
[edit]Base Assumption
[edit]In the data exploration stage we looked a different ways to dissect the data, after some tryouts we excluded bots/editors ratio as an indicator of an emerging community. We learned that technical contributors mainly worked on content articles and acknowledged that the content size of a wiki is not correlated with its editing activity level.
As a result of our exploration we settle in identifying "emerging communities that might benefit from more tooling" by looking mainly at two variables:
- The amount of edits on that wiki that are done by non-bots and
- The amount of content pages that a wiki has.
We also look at the amount of distinct bots that edit that wiki as a dependent variable. Our basic assumption is that, for a wiki to be healthy, automation is needed once the number of content pages is over a certain threshold.
How to identify a community that might be underserved by technical contributors (bot/tool builders)
[edit]We decided to choose "monthly non-bot edits" as the major metric to measure the need for automation/bots and "number of content pages" of a wiki as the secondary metric.
We first group wikis by its current number of "monthly bot editors" , "monthly non-bot edits" and "content pages". In this classification we look for outliers, wikis with a large number of content pages and a large number of "manual" edits but few bots, or wikis with large number of content pages but few edits and bots overall.
This might indicate a community that needs help developing tooling to be able to keep up with growth of their wiki. Before reaching out to the community at hand we need to look at other external markers like user pageviews for that wiki and overall edit history. These last two can be assessed via Wikistats. "Liveness" of talk pages is also an interesting one to asses whether there is a community behind the edits.
Number of Pages, Bot Editors, Nonbot Edits by Project | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
What is the ideal number of bots?
[edit]Given the relationship between non-bot edits and bot editors is not linear (Figure 1, Figure 2) we use percentiles to define the suggested number of bot editors to match its editing activity level.
Percentile | Nonbot_edits in April | Nonbot_edits in May | Content page | Avg of monthly bot_editors |
---|---|---|---|---|
0.25 | 221 | 246 | 2869 | 2.5 |
0.5 | 962 | 983 | 10541 | 5 |
0.75 | 6250 | 6384 | 82666 | 8.5 |
1 | 5621913 | 5762249 | 6071412 | 319.5 |
* Metric definition: |
The Table 1 shows the 25th, 50th, 75th and 100th percentile of each metric. The percentile of non-bot edits in two consecutive months (April 2020 and May 2020) is very consistent. The 25th percentile is 200+ edits, 50th percentile 900+ edits, and 75th percentile 6000+ edits. Our suggested ideal number of monthly bot editors for each percentile group is simplified as shown in Table 2. For a community which has 6000+ monthly non-bot edits, the ideal number of monthly bot editors is 9. For a community which has 900+ monthly nonbot edits, the ideal number of monthly bot editors is 5. For a community which has 200+ monthly nonbot edits, the ideal number of monthly bot editors is 3.
Percentile | Monthly nonbot edits | Content pages | Suggested Ideal monthly bot editors |
---|---|---|---|
0.25 | 200 | 2800 | 3 |
0.5 | 900 | 10000 | 5 |
0.75 | 6000 | 80000 | 9 |
Explorations
[edit]The Technical Engagement team had a few questions about technical contributors in wiki communities. While the definition of technical contributors includes a variety of contributions in very different technical areas, this research focus on contributors who write tooling to help with edits on a wiki. This tooling is normally referred to as "bots", which are automated scripts that run on our cloud environment that patrol Wikipedia doing tasks like, for example, removing vandalism by reverting edits.
Is the ratio of bots/editors high in emerging communities but low on established communities?
[edit]Comparing the ratio of bots/editors in emerging communities and established communities, it seems that a high bots/editors ratio is not a strong indicator that the community is an emerging community. Established communities tend to have a low bots/editors ratio as they usually have a large number of human editors. However, in some cases, some emerging communities could have a low bots/editors ratio when the number of bots is really very small. For example, in Table 3, German Wikipedia (dewiki), an established community, has 0.09% bots/editors rate. Hindi Wikipedia (hiwiki), an emerging community, has 0.1% bots/edits rate. The bots/editors ratios are very close even though those two wikis are in different development stages.
wiki_db | editors | bot_editors | bot_editor_ratio | edits | content_pages |
---|---|---|---|---|---|
dewiki | 80531 | 71 | 0.09% | 168897 | 2429468 |
hiwiki | 15449 | 15 | 0.10% | 20805 | 141852 |
* Metrics definition: |
Figure 3 is the scatter diagram of bots and editors on all Wikipedia projects. Figure 4 is a zoom-in of the low value area. Dots in the upper-right corner present the established Wikipedia communities. Dots in the lower-left corner present the emerging Wikipedia communities. Figure 3 and Figure 4 show bots and editors do not have a linear relationship. The bot/editor ratio could be the same in high value area and low value area. Therefore, bots/editors ratio is not an ideal indicator for us to identify the community is an emerging community or an established community. It cannot be the metric to measure whether a community has enough tooling to thrive.
What are the bots doing? What are the types of their contributions?
[edit]The spreadsheet includes bot edits by namespace across all projects from 2020.01.01 to 2020.05.31. It shows that 65.8% bot edits are for content pages on all projects. The content bot edit rate of content pages by bots varies between 0.02% and 100%. I listed a few interesting cases in Table 2. On English wikipedia, 49.87% bot edits are content edits. On wiki commons, 97% of bot edits are file edits. On Wiktionary and Wikidata, bots mainly focus on content editing.
project | project_family | Category | Category talk | Content | File | File talk | Help | Help talk | MediaWiki | MediaWiki talk | Other | Project | Project talk | Talk | Template | Template talk | User | User talk | Grand Total | Content Edits% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
en.wikipedia | wikipedia | 91833 | 16781 | 1962241 | 89944 | 991 | 50 | 399 | 37 | 313 | 38016 | 764770 | 15027 | 209610 | 45891 | 10989 | 496897 | 190703 | 3934492 | 49.87% |
ar.wikipedia | wikipedia | 596209 | 80865 | 2619849 | 7552 | 7 | 77 | 5 | 0 | 4 | 5475 | 30403 | 179 | 93504 | 125069 | 22977 | 17220 | 1328785 | 4928180 | 53.16% |
commons.wikimedia | commons | 331504 | 388 | 5021 | 20624623 | 1543 | 160 | 0 | 29 | 48 | 27676 | 91794 | 1001 | 143 | 6975 | 63 | 224111 | 34532 | 21349611 | 0.02% |
ca.wiktionary | wiktionary | 0 | 0 | 40412 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40412 | 100.00% |
www.wikidata | wikidata | 5 | 0 | 54007654 | 0 | 0 | 4910 | 10 | 32 | 36 | 744939 | 264339 | 745 | 923 | 2107 | 7 | 91969 | 1038 | 55118714 | 97.98% |
*Metric definition: |
What’s the metric to identify the community which needs more technical supportive work?
[edit]Given the function of bots, a community with a high volume of edits or existing content which needs to maintain will likely need more bot support. Mapping to some measurable metrics, the possible metrics could be the number of monthly edits and total content pages. Considering that the number of monthly edits inflated by existing bots, I chose non-bot edits to reflect the amount of organic edits. I also observed that the monthly non-bot edits are not correlated with total content pages in some communities. Those outliers in Figure 3 represent the communities which have a large number of total content pages but are at low monthly editing level now.
* Metrics definition: |
Take a look at one of the outliers, newwiki ( Newari Wikipedia). It has more than 60 thousand content pages, considered as a medium size Wikipedia. But from history we can see the pages are mainly created by bots. The number of non-bot edits has never grown. When bots are not active in newwiki, the monthly edits keep flat at a low level. For such a community which does not have many organic editors, should we provide more bot support? I have no answer for it yet. But it makes me choose monthly non-bot edits as the major metric to measure the needs for bots.
When a wiki community needs to start thinking about bots? How does the editor trend correlate with the growth of bot editing?
[edit]
* Data timeframe: 2001~2020-05-31 |
We build superset dashboard to explore this data (WMF internal only): https://superset.wikimedia.org/r/263
We studied ruwiki (medium size wikipedia), rowiki (small size wikipedia) and svwiki (large size wikipedia).
On ruwiki , when bot editing became active ( > 1k) in September 2004, the number of editors was 268 .
On rowiki, when bot editing became active ( > 1k) in July 2005, the number of editors was 134 .
On svwiki, when bot editing became active ( > 1k) in June 2005, the number of editors was 624.
Among the three wikis, only ruwiki has a stable monthly editing pattern (in terms of non bots edits). Svwiki and rowiki still rely on bots to create edits. It seems the growth of human editors is not correlated with the growth of bot editing. Also there is no clear answer to the question of when is best to introduce bot editing into the community. Wikis have different growth trajectories for many reasons.