Research talk:Automated classification of draft quality/Work log/2016-09-27
Add topicWednesday, September 28, 2016
[edit]Today I'm working to gather page creations (not deleted) that occurred in the last year. Regretfully, there's no historic log of page creations. But I can filter revisions for those that have no rev_parent_id.
SELECT rev_timestamp, page_title, rev_len, rev_user_text
FROM revision
INNER JOIN page ON
rev_page = page_id
WHERE
rev_timestamp BETWEEN "20150927" AND "20160927" AND
rev_parent_id = 0 AND
page_namespace = 0
LIMIT 10;
rev_timestamp | page_title | rev_len | rev_user_text |
---|---|---|---|
20150926000023 | General_Todorov | 37 | Ketiltrout |
20150926000028 | Scott_hoying | 982 | Lwp2004 |
20150926000319 | Parque_de_la_Bombilla_(Mexico_City) | 772 | Josedricoa |
20150926000435 | Mogilno_Falsification | 1591 | Tymek |
20150926000643 | Temple_of_Venus | 316 | LlywelynII |
20150926000727 | Motorslug | 2921 | Soul Crusher |
20150926000736 | Temple_of_Venus_(Baalbek) | 27 | LlywelynII |
20150926000840 | John_R._McDermott | 36 | CactusWriter |
20150926000940 | The_Hard_Easy | 60 | 23W |
20150926001001 | Conference_of_Secretaries_of_World_Christian_Communions | 1260 | 1549bcp |
OK. So, I'm thinking that we can get a sample of good pages this way.
Ultimately, I think we'll want a representative sample of pages that are:
- Not deleted
- Deleted for less concerning reasons (e.g. no assertion of importance)
- Deleted for immediately concerning reasons
- Spam
- Vandalism
- Attack
- Hoax
I think I'd like to lump the first two together, but first, I'll need to do my sampling individually. I want the following columns:
- page_title
- creation_rev_id
- creation_timestamp
- archived (did we find the page in the `archive` table?)
- creation_quality (OK, spam, vandalism, attack, hoax)
We should query in the past 30 days so that pages will have a chance to be deleted. See my work in R:Wikipedia article creation for justification of the 30 days threshold. Here's a query to get the good creations: https://quarry.wmflabs.org/query/12795 "English article creations that have survived at least 1 month"
Here's a query to get all of the deleted article creations in the same time period. https://quarry.wmflabs.org/query/12796 I needed to set the creation_quality to NULL because I'll need to join this with the logging table later in order to get a deletion reason for labeling. --EpochFail (talk) 00:51, 28 September 2016 (UTC)