Jump to content

Research:Bot user

From Meta, a Wikimedia project coordination wiki

Identifying user accounts that automatically perform repetitive tasks is useful for analytics and research, since it makes it possible to focus on the actions of human editors.

Strategies

[edit]

Bot flags

[edit]

Many projects make use of a user group to track and give special permissions to bot accounts. E.g. user_groups.ug_group = "bot". If Wikimedia communities use this flag consistently, then one could make use of the user_groups table to efficiently identify bot accounts.

Since the bot user group often grants special permissions, some projects remove accounts from the group when they become inactive or the community decides they should no longer be allowed (for example, MiszaBot I and Broadbot on the English Wikipedia). For this reason, it's also important to check if accounts had a bot flag in the past—that is, if user_former_groups.ufg_group = "bot".

There is also a global bot user right, which is stored in the centralauth database.

Username regex

[edit]

Many (most?) bots use usernames containing the string "bot". Wikistats has historically identified as a bot any user whose name contains either of the following:

  • "bot" (in any case) followed by anything other than a letter, a digit, or an underscore, including the end of the string (that is, a word boundary )
  • "bot" (in any case) preceded and followed by spaces (represented in the dumps by underscores).

In Perl terms:

if (($user =~ /bot\b/i) || ($user =~ /_bot_/i))

Wikistats excludes three users with bot-like names who have contacted the maintainer to say they are not bots: Paucabot, Niabot, and Marbot. There are other false positives (like some users named "Talbot") but adding to this list is not a priority since the effect on the statistics is small. In MariaDB terms, this filter can be applied with the following code:

where
    convert(user_name using utf8) regexp "bot\\\\b" and
    user_name not in ("Paucabot", "Niabot", "Marbot")

Only one regular expression is necessary because spaces are not converted to underscores in MariaDB, which means "bot" followed by a space is a word boundary. User names are stored in MariaDB as byte strings, so the regular expression would be case sensitive if they were not transformed into character strings using convert.

Proposals for the future

[edit]

For analytics and possibly attribution (e.g. see this thread), it would be useful to have a 100% reliable group identifying bots. The bot group does not fit this, since some bots are not in it because they want to appear in RecentChanges. Other wikis may give out the bot group to humans for making repetitive edits (e.g. dealing with a spam flood without clogging RecentChanges). One possibility is the following three groups:

  1. Group containing all bots, regardless of whether they want to appear in RecentChanges by default. This should not grant any rights.
  2. Group containing bots that do not want to appear in RecentChanges by default. Has the bot right.
  3. Group (often temporary) for humans that temorarily do not want to appear in RecentChanges by default. Has the bot right. This is sometimes called "flood". See meta:Flood flag and d:Wikidata:Flooders.

For simplicity it would be preferable to standardize this cross-wiki. If not, we should document variations and how to canonicalize it to an editor class for analytics purposes.

See also

[edit]