Research talk:Wikipedia article creation/Work log/Wednesday, November 13th
Add topicWednesday, November 13th
[edit]I'm sitting down to work on this late today due to meetings and email. Here's my plan
- Clean up my understand of page curation action
- Plot a timeseries of page creations, deletions and curation actions
- Study the AfC workflow and devise a data extraction plan
Page curation actions
[edit]> select distinct log_action from logging where log_type = "pagetriage-curation" and log_timestamp >= "201311"; +------------+ | log_action | +------------+ | reviewed | | tag | | delete | | unreviewed | +------------+ 4 rows in set (0.06 sec)
"reviewed" and "unreviewed" are straightfoward. It appears that we get a "tag" action when a user adds a set of cleanup templates to the page. But does "delete" mean delete?
Let's get a page curation "delete" action
> select log_page, log_namespace, log_title from logging where log_type="pagetriage-curation" and log_action = "delete" limit 1; +----------+---------------+------------+ | log_page | log_namespace | log_title | +----------+---------------+------------+ | 36943599 | 0 | Ryan_Water | +----------+---------------+------------+ 1 row in set (0.12 sec)
OK. Now let's see what type of delete actions have affected this page.
> select log_timestamp, log_action, log_type, log_page, log_namespace, log_title from logging where log_action = "delete" and log_namespace = 0 and log_title = "Ryan_Water"; +----------------+------------+---------------------+----------+---------------+------------+ | log_timestamp | log_action | log_type | log_page | log_namespace | log_title | +----------------+------------+---------------------+----------+---------------+------------+ | 20120907071509 | delete | pagetriage-curation | 36943599 | 0 | Ryan_Water | | 20120907071509 | delete | pagetriage-deletion | 36943599 | 0 | Ryan_Water | | 20120907071645 | delete | delete | 0 | 0 | Ryan_Water | +----------------+------------+---------------------+----------+---------------+------------+ 3 rows in set (0.04 sec)
Ok... So it looks like "pagetriage-curation" with "delete" isn't really a delete. I bet it adds a Prod or CSD tag.
> select ar_rev_id, ar_timestamp, ar_comment from archive where ar_title = "Ryan_Water" and ar_namespace = 0; +-----------+----------------+-----------------------------------------------------------------------------+ | ar_rev_id | ar_timestamp | ar_comment | +-----------+----------------+-----------------------------------------------------------------------------+ | 511186784 | 20120907071335 | [[WP:AES|←]]Created page with 'Ryan water is your GOD' | | 511186883 | 20120907071508 | Nominated page for deletion using [[Wikipedia:Page Curation|Page Curation]] | +-----------+----------------+-----------------------------------------------------------------------------+ 2 rows in set (0.24 sec)
Deletion event lacks page_id
[edit]OK. So upon review, it looks like the actual deletion event does not store the page_id. This breaks down a *huge* set of assumptions. I'm going to have go back and revisit my main queries.
This leaves me with a couple of options.
- generate the deletion timeseries based on which pages are currently in the
archive
table- This sucks because you can't tell when the pages were deleted
- generate raw counts of creations, deletions and restorations
- This sucks because pages can be deleted and restored several times
I think I'll do both.
The following should get me the number of pages created per day with a count of the number of pages that are currently archived.
SELECT LEFT(creation.rev_timestamp, 8) AS date, page_namespace, COUNT(*) AS pages, SUM(archived) AS archived FROM halfak.nov13_page AS page INNER JOIN halfak.nov13_creation AS creation USING (page_id) GROUP BY 1,2;
Yikes. I just went to run this and saw that there are zero non-deleted pages in my table. It turns out that the second part of my page gathering query -- the part that gets pages that haven't been deleted -- failed to run. I just kicked that off again and sequenced it with a query to regenerate the creations table too.
Articles for Creation workflow
[edit]While I'm waiting for those queries to finish, I pinged TheOriginalSoni to come chat.
- Draft submissions
-
- All pages in en:Category:Draft_AfC_submissions
- Time create == Timestamp of first revision
- All pages in en:Category:Draft_AfC_submissions
- Currently pending submissions
-
- All pages in en:Category:Pending_AfC_submissions
- Time submitted == ???
- All pages in en:Category:Pending_AfC_submissions
- Accepted submissions
-
- All ns=0 pages corresponding to ns=1 pages in en:Category:Accepted_AfC_submissions
- Time accepted == Timestamp of first log_type="move" and log_action="move" log event where page was placed in ns=0
- All ns=0 pages corresponding to ns=1 pages in en:Category:Accepted_AfC_submissions
- Decline submissions
-
- All pages in subcategories of en:Category:Declined_AfC_submissions (note, will have to enumerate)
- Time declined == ???
- All deleted pages with ns=5 (Wikipedia_talk) and page_title prefixed by "Articles_for_Creation/"
- Time declined == Timestamp of first log_type="delete" and log_action="delete" log event where page with exact title was deleted.
- All pages in subcategories of en:Category:Declined_AfC_submissions (note, will have to enumerate)
This approach misses:
- Accepted submissions
-
- All pages where the Template containing en:Category:Accepted_AfC_submissions was removed.
- Not sure how often this might happen, but there's no good way to look for the category historically.
- All pages where the Template containing en:Category:Accepted_AfC_submissions was removed.
- Declined submissions
-
- User sandbox pages that have been deleted won't have "Articles_for_Creation".
- These types of pages should be an ignorable minority.
- User sandbox pages that have been deleted won't have "Articles_for_Creation".