Grants talk:TPS/Ladsgroup/Wikimania/2015/Report
Add topicORES and bad words
[edit]Hi Amir and thank you for the detailed report and for a lot of things you have done at Wikimania!
I have noticed that you were working on the list of bad words for Ukrainian (uk), but could you please explain how it works? I looked at the link (supposing that number is a probability of being a bad word) and I have noticed that collection of Ukrainian words is very misleading. While some words (e.g. #302, #298, #293, #292, #290, #283 etc.) are really bad, there are many good Ukrainian words (e.g. #301 is "artist's", #279 is "pre-Mongolian", #275 is an administrative unit of Poland, #274 means "prognosist's") and a lot of non-Ukrainian words, both good and bad (e.g. #300 is Russian for "s*ck", #296 is Russian for "sh*t", #290 is Russian for "p*ssy" but #299 is Russian for "promote", #291 is Russian for "continue working", #287 is Russian for "search engine"... going ahead #48 is Serbian for "hospitals" or #12 is Belarusian for "not needed"). It is great idea to develop such tool as ORES, but could you please tell me where you got this list and how to fix it? Thanks — NickK (talk) 13:15, 4 August 2015 (UTC)
- Hey, thank you. This list is based on words that are common among reverted edits but not common in not reverted edits (good edits). For all languages the result is good but for some languages because of very strict abuse filter the result is not as good as others the only languages with this problem are Persian, Ukrainian and Indonesian. But don't worry! all of this list goes for human review and then it will be merged to Revscoring we already did that for several languages including Hebrew, Dutch, etc. and even for some languages with very good results we still have some false positives (e.g. French had 6 false positives out of 250) Another thing: Sometimes the word is tricky. we had "izan" for Dutch which turned out to be backward of "nazi". Or usernames of prominent contributors came up in several reports and it's still a vandalism. Amir (talk) 14:14, 4 August 2015 (UTC)
- OK, I got the point. Firstly, in Ukrainian Wikipedia we have strong filters on obscene language (they are not public for obvious reasons but I can send you the code), thus most obscene words are already out. Secondly, there is a problem with Russian: some users may add a text in Russian to Ukrainian Wikipedia, and it will be reverted unless it is a quote, thus you got a lot of good Russian words like "promote" or "search engine". Thirdly, I can associate some words with problematic contributors (e.g. #12 for Belarusian "not needed" is due to a user who added long quotes without translations, and #275 is a result of edit war about correct translation of the name of this administrative unit). Thus I would be glad to know when you will publish these lists for human review. P.S. If there is a better page for discussing this, we can continue this discussion there — NickK (talk) 15:33, 4 August 2015 (UTC)
- The steps are 1- review the list and split it to three parts 1- words that are not okay to be used anywhere, like curse words (e.g. "sh*t", etc.) 2- Words that are not okay to use in article namespace but it's okay to use in talk namespaces like "Hey", "lol", "haha" 3- Words that are okay to use everywhere. And then the native speaker should put the list in this page and after that we make a patch and add them into revscoring. If you're familiar with github simply create another file like this and make PR, we would be happy to review it :) Also there is a phabricator task and you can subscribe to it. Best Amir (talk) 21:56, 4 August 2015 (UTC)
- OK, I got the point. Firstly, in Ukrainian Wikipedia we have strong filters on obscene language (they are not public for obvious reasons but I can send you the code), thus most obscene words are already out. Secondly, there is a problem with Russian: some users may add a text in Russian to Ukrainian Wikipedia, and it will be reverted unless it is a quote, thus you got a lot of good Russian words like "promote" or "search engine". Thirdly, I can associate some words with problematic contributors (e.g. #12 for Belarusian "not needed" is due to a user who added long quotes without translations, and #275 is a result of edit war about correct translation of the name of this administrative unit). Thus I would be glad to know when you will publish these lists for human review. P.S. If there is a better page for discussing this, we can continue this discussion there — NickK (talk) 15:33, 4 August 2015 (UTC)