Research talk:Are the bots really fighting/Work log/2017-03-21
Add topicTuesday, March 21, 2017
[edit]Staeiou
[edit]Updated work on the comment parser in this Jupyter notebook. This uses the heuristic that comments with wiki language codes in between punctuation indicate an interwiki link update, but categorizes it as "interwiki link cleanup -- suspected". I think it might take peeking into content diffs to really be confident that those are actually interwiki link actions.
Comment parsing is a good way to find new cases of potential interesting cases. I've found a few more and included them in the parser too. However, I still think I'm missing some, and I want to go and manually look for cases of bot-vs-bot reverts I remember from various BAG/ANI/etc threads and see how the diffs appear in the dataset.
I ran the notebook with Halfak's updated bot2bot dataset based on this Quarry query that joins to get rev_comments. This gives us the following breakdown:
All namespaces
[edit]type | count | percent |
---|---|---|
interwiki link cleanup | 180293 | 37.54% |
fixing double redirect | 90013 | 18.74% |
AIV helperbot | 77390 | 16.11% |
interwiki link cleanup -- suspected | 58577 | 12.2% |
other w/ per justification | 19761 | 4.11% |
deleted revision | 16046 | 3.34% |
other | 10414 | 2.17% |
archiving | 8268 | 1.72% |
clearing sandbox | 5080 | 1.06% |
other w/ revert in comment | 3992 | 0.83% |
moving category | 3302 | 0.69% |
protection template cleanup | 2819 | 0.59% |
category redirect cleanup | 1517 | 0.32% |
orphan template cleanup | 1028 | 0.21% |
mathbot mathlist updates | 519 | 0.11% |
other redirect | 352 | 0.07% |
botfight: reverting CommonsDelinker | 318 | 0.07% |
botfight: 718bot vs ImageRemovalBot | 173 | 0.04% |
redirect tagging/sorting | 163 | 0.03% |
link syntax fixing | 111 | 0.02% |
botfight: infoboxneeded | 96 | 0.02% |
template cleanup | 68 | 0.01% |
template tagging | 24 | 0.0% |
commons image migration | 5 | 0.0% |
ns0 only
[edit]type | percent | percent |
---|---|---|
interwiki link cleanup | 82244 | 38.3% |
fixing double redirect | 81907 | 38.14% |
interwiki link cleanup -- suspected | 36265 | 16.89% |
deleted revision | 3545 | 1.65% |
protection template cleanup | 2631 | 1.23% |
moving category | 1987 | 0.93% |
other | 1622 | 0.76% |
orphan template cleanup | 1020 | 0.48% |
category redirect cleanup | 977 | 0.45% |
other w/ revert in comment | 519 | 0.24% |
mathbot mathlist updates | 515 | 0.24% |
other w/ per justification | 480 | 0.22% |
botfight: reverting CommonsDelinker | 222 | 0.1% |
other redirect | 183 | 0.09% |
botfight: 718bot vs ImageRemovalBot | 170 | 0.08% |
redirect tagging/sorting | 163 | 0.08% |
botfight: infoboxneeded | 96 | 0.04% |
link syntax fixing | 85 | 0.04% |
template cleanup | 68 | 0.03% |
template tagging | 24 | 0.01% |
commons image migration | 3 | 0.0% |
clearing sandbox | 1 | 0.0% |
template tagging | 24 | 0.0% |
commons image migration | 5 | 0.0% |
Staeiou (talk) 00:30, 21 March 2017 (UTC)
Update: using lang codes might have an issue
[edit]I just realized lang codes contain very common two-letter English words: it, or, an, is. This might make it problematic to use this as a heuristic. I also included commons, meta, and simple. There are probably a lot of false positives in that. Maybe need to scope the punctuation that counts as a valid bordering character, or just use this as a way to filter candidates that will be examined closer with diffs. Also maybe only call that function for bots approved for interwiki tasks. Staeiou (talk) 00:51, 21 March 2017 (UTC)