Research talk:Automated classification of draft quality/Work log/2016-09-26
Add topicMonday, September 26, 2016
[edit]I've finally got some time to look at this problem. My goal today is to run a few queries that will allow me to extract a random sample of article creations from a recent time period (this year?) with labels for spam, vandalism, or "other". I'll focus on English Wikipedia for now. I'll be using the deletion log to get the sample of bad pages. We'll need to figure out some page_id bounds for gathering a sample of good pages too.
First things first, how do we find the bad page creations?
Query: https://quarry.wmflabs.org/query/12780 Referencing: en:Wikipedia:Criteria_for_speedy_deletion
I think we're generally interested in:
- WP:CSD#G3 -- Pure vandalism and blatant hoaxes
- WP:CSD#G10 -- Pages that disparage, threaten or harass
- WP:CSD#G11 -- Unambiguous advertising
- WP:CSD#A11 -- Obviously invented
With this query, we get 52810 results which is a pretty good set of "positive" examples:
SELECT log_id, log_title, log_comment, log_namespace
FROM logging
WHERE
log_type = "delete" AND
log_action = "delete" AND
log_timestamp BETWEEN "20150901" AND "20160901" AND
log_comment LIKE "[[WP:CSD#%" AND
log_comment REGEXP "WP:CSD#(G3|G10|G11|A11)\\|";
So that seems to work nicely. Now we need a sample of articles that were deleted for innocuous reasons and articles that weren't deleted at all. --EpochFail (talk) 22:28, 26 September 2016 (UTC)
Oooh. I made a table of deletion reasons we get in this set with this query: https://quarry.wmflabs.org/query/12782
deletion_reason | COUNT(*) |
---|---|
attack | 3427 |
hoax | 2132 |
spam | 42498 |
vandalism | 10144 |