Community Wishlist Survey 2017/Miscellaneous/Overhaul spam-blacklist

Overhaul spam-blacklist

Problem: The current blacklist system is archaic; it does not allow for levels of blacklisting, is confusing to editors. Main problems include that the spam blacklist is indiscriminate of namespace (an often re-occurring comment is that it should be possible to discuss about a link in talkspaces, though not to use it in content namespaces). The blacklist is a black-and-white choice, allowing additions by only non-autoconfirmed editors, or only by admins is not possible. Also giving warnings is not possible (on en.wikipedia, we implemented XLinkBot, who reverts and warns - giving a warning to IPs and 'new' editors that a certain link is in violation of policies/guidelines would be a less bitey solution).

Who would benefit: The community at large

Proposed solution: Basically, replace the current mw:Extension:SpamBlacklist with a new extension based on mw:Extension:AbuseFilter by taking out the 'conditions' parsing from the AbuseFilter and replace it with only parsing regexes matching added external links (technically, the current AbuseFilter is capable of doing what would be needed, except that in this form it is extremely heavyweight to use for the number of regexes that is on the blacklists). Expansions could be added in forms of whitelisting fields, namespace selectors, etc.

expanded solution

The following discussion has been closed. Please do not modify it.

Take the current AbuseFilter, rename it to SpamFilter, take out all the code that interprets the rules ('conditions').
Make 2 fields in replacement for the 'conditions' field:
- one text field for regexes that block added external links (the blacklist). Can contain many rules (one on each line, like current spam-blacklist).
- one text field for regexes that override the block (whitelist overriding this blacklist field; that is generally simpler and cleaner than writing a complex regex, not everybody is a specialist on regexes).
Add namespace choice (checkboxes like in search; so one can choose not to blacklist something in one particular namespace, with addition of an 'all', a 'content-namespace only' and 'talk-namespace only'.
- Some links are fine in discussions but should not be used in mainspace, others are a total nono
- Some image links are fine in the file-namespace to tell where it came from, but not needed in mainspace
Add user status choice (checkboxes for the different roles, or like the page-protection levels)
disallow IPs and new users to use a certain link (e.g. to stop spammers from creating socks, while leaving it free to most users).
Leave all the other options:
- Discussion field for evidence (or better, a talk-page like function)
- Enabled/disabled/deleted - not needed, turn it off, obsolete then delete
- 'Flag the edit in the edit filter log' - maybe nice to be able to turn it off, to get rid of the real rubbish that doesn't need to be logged
- Rate limiting - catch editors that start spamming an otherwise reasonably good link
- Warn - could be a replacement for en:User:XLinkBot
- Prevent the action - as is the current blacklist/whitelist function
- Revoke autoconfirmed - make sure that spammers are caught and checked
- Tagging - for combining certain rules to be checked by RC patrollers.
- I would consider to add a button to auto-block editors on certain typical spambot-domains (a function currently taken by one of Anomie's bots on en.wikipedia).

This should overall be much more lightweight than the current AbuseFilter (all it does is regex-testing as the spam-blacklist does, only it has to cycle through maybe thousands of AbuseFilters). One could consider to expand it to have rules blocked or enabled on only certain pages (for heavily abused links that actually should only be used on it's own subject page). Another consideration would be to have a 'custom reply' field, pointing the editor that gets blocked by the filter as to why it was blocked.

Possible expanded features:

block or whitelist links matching regexes on specific pages (disallow linking throughout except for on the subject page)
block or whitelist links matching regexes when added by specific user/IP/IP-range (disallow specific users to use a domain)

More comments:

Phabricator tickets: task T6459 (where I proposed this earlier)

Proposer: Dirk Beetstra ^{T C} (en: U, T) 11:35, 14 November 2017 (UTC)[reply]

Translations: none yet

Discussion

I agree, the size of the current blacklists is difficult to work with; I would be blacklisting a lot more spam otherwise. A split of the current blacklists is also desired:

I still want to see a single, centralized, publicly available, machine readable spam blacklist for all the spammers, bots, black hat SEOs and other lowlifes so that they can be penalized by Google and other search engines. This list must continue to be exported to prevent spam on other websites. Autoblocking is also most useful here.
The same goes for URL shorteners and redirects -- this list would also be useful elsewhere. This is one example where the ability to hand out customized error messages (e.g. "hey, you added a URL shortener; use the original URL instead") is useful.

The remaining domains might belong on a private list with all the options described above.
Please consider integrating the extension into core MediaWiki; it is already bundled with the installer. MER-C (talk) 11:57, 14 November 2017 (UTC)[reply]
- Do note that there are a lot of domains on the blacklist which are not due to 'lowlifes' - quite a number of pornographic sites are blacklisted because of uncontrollable abuse, not because of them being spammed, let alone by site-owners or their SEOs. Also URL shorteners are blocked because of nature and abuse, not because of themselves being spam. In those cases I actually agree with complaints that these sites are penalized for being on the blacklists. I do agree that a full list of those domains that are due to the SEO/spammers/bots and other lowlifes should be publicly visible (note: COIBot and LiWa3 collect all the blacklists in off-wiki files for referencing purposes, it would be rather easy to publish those collective records on-wiki as public information). --Dirk Beetstra ^{T C} (en: U, T) 12:12, 14 November 2017 (UTC)[reply]
Another suggestion: one needs to have the option to match against norm(added_lines) instead for continued spamming of blacklisted links. I've seen forum spam that needs this solution, we need to have an equivalent here as well. MER-C (talk) 12:28, 14 November 2017 (UTC)[reply]
- Check, but I think that that type of parsing is (partially?) in the current blacklist. I have seen XLinkBot-evasion by using hex-codes (which I subsequently coded into the bots). --Dirk Beetstra ^{T C} (en: U, T) 12:31, 14 November 2017 (UTC)[reply]
@Beetstra: For the sake of clearance: you want to replace AbuseFilter extension or you want to add a new extension based on AbuseFilter? --Vachovec1 (talk) 21:20, 14 November 2017 (UTC)[reply]
- This proposes to replace mw:Extension:SpamBlacklist with this functionality. MER-C (talk) 03:03, 15 November 2017 (UTC)[reply]
- @Vachovec1: I want add a new extension based on AbuseFilter (that seems to me the most logical start, as functionality in the AbuseFilter is quite appropriate, but too heavy for this), to replace the current spam-blacklist. --Dirk Beetstra ^{T C} (en: U, T) 05:22, 15 November 2017 (UTC)[reply]
  - OK. Then I would propose to start the section Proposed solution with something like: "Replace the mw:Extension:SpamBlacklist with a new extension based on mw:Extension:AbuseFilter.", to make it crystal clear. --Vachovec1 (talk) 10:52, 15 November 2017 (UTC)[reply]
    - Done, language may need some tweaking though. Thanks for the suggestion. --Dirk Beetstra ^{T C} (en: U, T) 12:07, 15 November 2017 (UTC)[reply]

My issue with this (as I have with supposed “spam-fighting”) is that it takes way too much collateral damage both when it comes to users as when it comes to content, many useful sites are blacklisted purely because a user is banned, and if a user gets globally banned the link 🔗 gets globally blacklisted and removed from any Wikimedia property even if it were used as a source 100% of the time, now let's imagine a year or so later someone wants to add content using that same link (which is now called a “spamlink”) this user will be indefinitely banned simply for sourcing content. I think 🤔 that having unsourced content is a larger risk to Wikimedia projects than alleged “spam” has ever been. This is especially worrisome for mobile users (which will inevitably become the largest userbase) as when you're attempting to save an edit it doesn't even warn you why your edit won't save, but simply says “error” so a user might attempt to save it again and then gets blocked for “spamming”. Abuse filters currently don't function 100% accurately, and having editors leave the project forever simply because they attempted to use “the wrong 👎🏻” reference is bonkers. Sent 📩 from my Microsoft Lumia 950 XL with Microsoft Windows 10 Mobile 📱. --Donald Trung (Talk 🤳🏻) (My global lock 😒🌏🔒) (My global unlock 😄🌏🔓) 10:15, 15 November 2017 (UTC)[reply]

Also after a link could be blacklisted someone might attempt to translate a page and get blocked, the potential for collateral damage is very high, how would this "feature" attempt to keep collateral damage to a minimum? --Donald Trung (Talk 🤳🏻) (My global lock 😒🌏🔒) (My global unlock 😄🌏🔓) 10:15, 15 November 2017 (UTC)[reply]

@Donald Trung: that is not going to change, actually, this suggestion is giving more freedom on how to blacklist and whitelist material. The current system is black-and-white, this gives many shades of grey to the blacklisting system. In other words, your comments are related to the current system.

Regarding the second part of your comment - yes, that is intended use of the system, if it is spammed to page one, then translating that page does not make it a good link on the translation (and actually, this situation could actually also be avoided in the new system). --Dirk Beetstra ^{T C} (en: U, T) 10:39, 15 November 2017 (UTC)[reply]

The blacklist currently prevents us from adding a link to a site, from the article about that site. This is irrational. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 14:03, 15 November 2017 (UTC)[reply]
- @Pigsonthewing: What do you mean, do I have an unclear sentence? If it is what I think, is that I would like per-article exceptions (though that is a less important feature of it). --Dirk Beetstra ^{T C} (en: U, T) 14:29, 15 November 2017 (UTC)[reply]
- Ah, I think I get it, you are describing a shortcoming of the current system - that is indeed one of the problems (though there are reasons why sometimes we do not want to do that (e.g. malware sites), or where the link gets more broadly blacklisted (we blacklist all of .onion, which is then indeed not linkable on .onion, but also not on subject X whose official website is a .onion .. ). But the obvious cases are there indeed. I would indeed like to have the possibility to blanket whitelist for specific cases, like <subject>.com on <subject> (allowing full (primary) referencing on that single page, it is now sometimes silly that we have to allow for a /about to link to a site on the subject Wikipage to avoid nullifying the blacklist regex, or a whole set of specific whitelistings to allow sourcing on their own page), or on heavily abused sites really allow whitelisting only for a very specific target ('you can only use this link on <subject> and nowhere else'). --Dirk Beetstra ^{T C} (en: U, T) 14:35, 15 November 2017 (UTC)[reply]

Or just add an option to AbuseFilter to compare against a regexp list that's on a wikipage. (Would require some thought in that we might want to expose the matching rule in the error message and logs, but otherwise easy.)

More generally, it would be nice if we could standardize on AbuseFilter instead of having five or six different anti-abuse systems with fractured UX and capabilities. That's a bit beyond CommTech's scope though. --Tgr (WMF) (talk) 23:54, 18 November 2017 (UTC)[reply]

No, User:Tgr (WMF), using the current AbuseFilter for this is going to be a massive overload of the servers, it will still interpret the whole rule and we would probably have hundreds if not thousands of separate filters for this. It also would not allow for whitelisting (unless, again, you write a full rule with even more overload), namespace exclusion (unless ..), user-level exclusion (unless ..).

Making the AbuseFilter more modular may be an idea .. please read my suggestions above as a detailed request for capabilities. I am not familiar with the coding of the AbuseFilter to see how far this would need to go. --Dirk Beetstra ^{T C} (en: U, T) 11:00, 20 November 2017 (UTC)[reply]

Voting

Support per my comment above. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:47, 27 November 2017 (UTC)[reply]
Support MER-C (talk) 01:54, 28 November 2017 (UTC)[reply]
Support --Liuxinyu970226 (talk) 13:12, 28 November 2017 (UTC)[reply]
Support Sadads (talk) 13:43, 28 November 2017 (UTC)[reply]
Support Thomas Obermair 4 (talk) 21:57, 28 November 2017 (UTC)[reply]
Support Darylgolden (talk) 14:24, 29 November 2017 (UTC)[reply]
Support MGChecker (talk) 22:14, 29 November 2017 (UTC)[reply]
Support --Dirk Beetstra ^{T C} (en: U, T) 08:01, 30 November 2017 (UTC)[reply]
Support Jo-Jo Eumerus (talk, contributions) 11:25, 2 December 2017 (UTC)[reply]
Support Galobtter (talk) 12:48, 3 December 2017 (UTC)[reply]
Support enL3X1 ¡‹delayed reaction›¡ 16:10, 3 December 2017 (UTC)[reply]
Support This will fix multiple problems at once. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 07:37, 4 December 2017 (UTC)[reply]
Support Doc James (talk · contribs · email) 02:32, 5 December 2017 (UTC)[reply]
Support blacklist support needs some attention to make sure the system scales well and allows discussing individual links, etc. Also In the past I had issues with pages that had Public Domain images downloaded from website which was latter added to a black list. As a result, it was impossible to add and remove categories to the file without tripping the blacklist. Even for an admin. That issue might have been fixed, but we need a flexible system that can recognize new text from an old one and objection can be overruled if needed. --Jarekt (talk) 14:34, 7 December 2017 (UTC)[reply]
@Jarekt: I guess what you mention has been resolved (pages with blacklisted links are editable, as long as one does not add the link again). What you describe is however one of the examples I had in mind - there are certain links which are fine on talkpages or on file-description pages, but not in mainspace (though those cases are limited). --Dirk Beetstra ^{T C} (en: U, T) 12:46, 10 December 2017 (UTC)[reply]
Support Ahm masum (talk) 21:27, 7 December 2017 (UTC)[reply]
Support X:: black ::X (talk) 10:42, 10 December 2017 (UTC)[reply]
Support — Luchesar • T/C 13:51, 11 December 2017 (UTC)[reply]
Support — NickK (talk) 17:01, 11 December 2017 (UTC)[reply]