Research talk:Revision scoring as a service/2014
Add topicPortuguese stemmer
[edit]For future reference: I opened a few issues on NLTK about the Portuguese snowball stemmer which is used by Revision-Scoring. Helder 21:03, 4 October 2014 (UTC)
Badword lists
[edit]If you know any other lists of badwords, please add them to the subsections below. Helder 12:38, 7 October 2014 (UTC)
English
[edit]- Revision-Scoring/.../english.py
- w:en:User:Lupin/badwords
- w:en:WP:Huggle/Config#Prediction
- wikt:en:Category:English vulgarities
- shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
- en:User:DeltaQuad/UAA/Blacklist
- Offensive/Profane Word List from Luis von Ahn's Research Group
Portuguese
[edit]- w:User:Salebot/Config (best viewed with a script)
- w:pt:WP:Software/Anti-vandal tool/badwords
- w:pt:User:Alchimista/Expressões.css
- w:pt:WP:Huggle/Config#Previsão
- w:pt:WP:Projetos/AntiVandalismo/Expressões problemáticas
- wikt:pt:Categoria:Obscenidade (Português)
- Abuse filters:
- shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
Turkish
[edit]- w:tr:Vikipedi:Huggle/Config#Prediction
- shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
- tr:Kullanıcı:Manco Capac/badwords
Spanish
[edit]Structure
[edit]I strong support the Option 2: Models as external services. The collected data can have many different uses, we maybe can't imagine all of then, and in the same use, different developers can have different approaches, for example machine learning can be implemented using different learning algorithms. Danilo.mac talk 14:21, 8 December 2014 (UTC)
- We can also make a mixed service, with a public database and an API to query raw data, and a standard score processing service. Danilo.mac talk 14:40, 8 December 2014 (UTC)
- How to integrate such models already working? Also, where each interface ends and begin in such relation? My concern is that those models (eg: Huggle, ClueBot NG, etc) were not thought as reusable tools, what implies work on both sides in the beginning: a common interface by the revision service and those models. It's an interesting option, Danilo, I just raised what I think are the next questions. To me a good approach could be: a) provide a common interface where any model can connect to and then b) include a standard score service, as you said Danilo. --Jonas AGX (talk) 18:37, 8 December 2014 (UTC)
- I think the application could have API to raw data, for example "?list=page&..." to get the collected data for a page, "?list=user&..." to get evaluations made by an user, "?list=all&wiki=..." to list all evaluations for a wiki, etc, and have an API to processed data, for example "?estimate=revision&wiki=Xwiki&rev=12345" to get the estimation of a revision be valid or invalid (vandalism) and good faith and bad faith, the response could be some like "{rev: 12345, valid: 0.03854, gfaith: 0.78432}" and "?estimate=user&wiki=Xwiki&user=userid" to return an estimations of an user damaging and good faith. The external applications could use these processed data or use the raw data, depending on what the application does. Danilo.mac talk 20:21, 8 December 2014 (UTC)
- How to integrate such models already working? Also, where each interface ends and begin in such relation? My concern is that those models (eg: Huggle, ClueBot NG, etc) were not thought as reusable tools, what implies work on both sides in the beginning: a common interface by the revision service and those models. It's an interesting option, Danilo, I just raised what I think are the next questions. To me a good approach could be: a) provide a common interface where any model can connect to and then b) include a standard score service, as you said Danilo. --Jonas AGX (talk) 18:37, 8 December 2014 (UTC)
- Hey guys. I think that building our own service is the best way to get started, but that doesn't mean that we need to hide the revision hand-coding data from other people. I agree that it could be very useful for other projects. I think that we should publish it openly.
- Also, we can use abstraction and interfaces to make the implementation of new "scorer" types easy. I have a proposal for that on trello. Once He7d3r and ToAruShiroiNeko have had a chance to review it for craziness, I'll flesh it out on the wiki. --Halfak (WMF) (talk) 23:30, 8 December 2014 (UTC)
- Halfak, that's good to know that everybody agree with an open database. About the code design, I think that it depends of the algorithm we will use, bayesian classification for example works with a list of probabilities, that is generated using a list of params of the scored revisions (e.g. user is ip, number of words added, article age, etc), so in this case we can keep the params in database and retrieve the probabilities using a SQL query, and the code can classify a revisions using only a dict of probabilities and the list of params of this revision. And another idea, that I don't know if is a good one, is to use Twisted to deal with the large amount and different types of requests the application will receive, I have only used Twisted in a IRC bot, but it maybe can also be a good options to deal with API requests. Danilo.mac talk 16:16, 10 December 2014 (UTC)
┌─────────────────────────────────┘
Danilo.mac, I think that a bayesian classifier will be trained in much the same way as any other ML classifier. I don't think that storing ML configs in a DB is worthwhile since models tend to not correspond to the structure imposed by relational DBs. We'll get more flexibility if we allow different models the ability to serialize themselves as "model files". This is the strategy that I used in wikiclass and it seems to work nicely. See [1] and the model file I produced for classifying pages on enwiki.
Re. twisted, I think that is an interesting idea, but I don't want to over-complicate. I think a simple flask app will serve our purposes, but it seems that we can just run any WSGI within a twisted server, so we might be imagining the same thing. --Halfak (WMF) (talk) 16:48, 10 December 2014 (UTC)
- We talked by IRC, but only to keep the register, I agree database is not good because it is not flexible, it is better keep the data in TSV and Pickle files. Danilo.mac talk 00:50, 13 December 2014 (UTC)
- For convenience: IRC log from 10 December 2014 (after 16:54:47). Helder 10:05, 13 December 2014 (UTC)
Progress report 2014-12-15
[edit]Hey folks,
He7d3r, Danilo.mac and I made some substantial progress over the weekend.
- We settled on a design for the abstract scoring system [2]
- We fleshed out the design in code [3]
- We implemented a simple LinearSVC scorer and model (see [4])
- We also demonstrated that the Model/scorer work on a very simple test [5] (see [6])
As I post this Danilo.mac is working on extracting features for a large set of real revisions. We'll be able to use this to test a useful classifier.
Next, I hope to address issues we have with our feature extractor structure. Stay tuned. --EpochFail (talk) 23:16, 15 December 2014 (UTC)
- Hey @EpochFail: did you save a copy of that draft of a script to "give me a list of edits which are likely vandalism but which were not reverted yet"? Helder 00:28, 16 December 2014 (UTC)
- I did. It's a little psuedo-code-ish. See below --EpochFail (talk) 00:33, 16 December 2014 (UTC)
from itertools import groupby
from mw import api
from mw.lib import reverts
from revscores import APIExtractor
from revscores.scorers import LinearSVC
model = LinearSVC.MODEL.from_file(open("ptwiki.model", 'rb'))
session = api.Session("https://pt.wikipedia.org/w/api.php")
extractor = APIExtractor(session)
scorer = LinearSVC(extractor, model)
revisions = session.revisions.query(after=<one week ago>,
before=<now>)
page_revisions = groupby(revisions, r['pageid'])
for page_id, revisions in page_revisions:
detector = reverts.Detector()
for rev in revisions:
revert = detector.process(rev['sha1'], rev)
if revert is None: # no revert happened
score = scorer.score(rev['revid'])
if score > .5:
print(rev['pagetitle'])
Progress report: 2014-12-27
[edit]Hey folks,
We'll start doing these progress reports on Friday. Despite this, I'm a day late due to holiday celebrations. In the last week and half, we:
- Merged some changes to make testing classifiers easier: [7].
- Improved the structure of feature extractor so that it is easier to develop on top of: [8]
- Regretfully, there are some limitations on what you can do with python decorators and pickling, so we needed to make some other changes to make sure that models could serialized. [9]
- Danilo and Helder have been hard at work testing classification models against ptwiki data, but we're still debugging some issues. See [10] and [11] for recent progress.
- We started some discussions about revision scoring on azwiki: az:Vikipediya:Kənd_meydanı
- We put together an IPython notebook demonstrating how a new scorer can be trained and used. [12].
There's likely more bits that I might have missed. I invite He7d3r and とある白い猫 to check. :) Happy holidays! --EpochFail (talk) 17:35, 27 December 2014 (UTC)
- Danilo and I read the paper Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence this week, and I also read Cross-Language Learning from Bots and Users to detect Vandalism on Wikipedia.
- I also tested the requirements for running the current code on Linux Mint 17.
- Helder 20:03, 27 December 2014 (UTC)