Jump to content

Research talk:Wikipedia article creation/Work log/Wednesday, November 13th

Add topic
From Meta, a Wikimedia project coordination wiki

Wednesday, November 13th

[edit]

I'm sitting down to work on this late today due to meetings and email. Here's my plan

  1. Clean up my understand of page curation action
  2. Plot a timeseries of page creations, deletions and curation actions
  3. Study the AfC workflow and devise a data extraction plan

Page curation actions

[edit]
> select distinct log_action from logging where log_type = "pagetriage-curation" and log_timestamp >= "201311";
+------------+
| log_action |
+------------+
| reviewed   |
| tag        |
| delete     |
| unreviewed |
+------------+
4 rows in set (0.06 sec)

"reviewed" and "unreviewed" are straightfoward. It appears that we get a "tag" action when a user adds a set of cleanup templates to the page. But does "delete" mean delete?

Let's get a page curation "delete" action

> select log_page, log_namespace, log_title from logging where log_type="pagetriage-curation" and log_action = "delete" limit 1;
+----------+---------------+------------+
| log_page | log_namespace | log_title  |
+----------+---------------+------------+
| 36943599 |             0 | Ryan_Water |
+----------+---------------+------------+
1 row in set (0.12 sec)


OK. Now let's see what type of delete actions have affected this page.

> select log_timestamp, log_action, log_type, log_page, log_namespace, log_title from logging where log_action = "delete" and log_namespace = 0 and log_title = "Ryan_Water";
+----------------+------------+---------------------+----------+---------------+------------+
| log_timestamp  | log_action | log_type            | log_page | log_namespace | log_title  |
+----------------+------------+---------------------+----------+---------------+------------+
| 20120907071509 | delete     | pagetriage-curation | 36943599 |             0 | Ryan_Water |
| 20120907071509 | delete     | pagetriage-deletion | 36943599 |             0 | Ryan_Water |
| 20120907071645 | delete     | delete              |        0 |             0 | Ryan_Water |
+----------------+------------+---------------------+----------+---------------+------------+
3 rows in set (0.04 sec)

Ok... So it looks like "pagetriage-curation" with "delete" isn't really a delete. I bet it adds a Prod or CSD tag.

> select ar_rev_id, ar_timestamp, ar_comment from archive where ar_title = "Ryan_Water" and ar_namespace = 0;
+-----------+----------------+-----------------------------------------------------------------------------+
| ar_rev_id | ar_timestamp   | ar_comment                                                                  |
+-----------+----------------+-----------------------------------------------------------------------------+
| 511186784 | 20120907071335 | [[WP:AES|←]]Created page with 'Ryan water is your GOD'                      |
| 511186883 | 20120907071508 | Nominated page for deletion using [[Wikipedia:Page Curation|Page Curation]] |
+-----------+----------------+-----------------------------------------------------------------------------+
2 rows in set (0.24 sec)

Deletion event lacks page_id

[edit]

OK. So upon review, it looks like the actual deletion event does not store the page_id. This breaks down a *huge* set of assumptions. I'm going to have go back and revisit my main queries.

This leaves me with a couple of options.

  1. generate the deletion timeseries based on which pages are currently in the archive table
    • This sucks because you can't tell when the pages were deleted
  2. generate raw counts of creations, deletions and restorations
    • This sucks because pages can be deleted and restored several times

I think I'll do both.

The following should get me the number of pages created per day with a count of the number of pages that are currently archived.

SELECT
    LEFT(creation.rev_timestamp, 8) AS date,
    page_namespace,
    COUNT(*) AS pages,
    SUM(archived) AS archived
FROM halfak.nov13_page AS page
INNER JOIN halfak.nov13_creation AS creation USING (page_id)
GROUP BY 1,2;

Yikes. I just went to run this and saw that there are zero non-deleted pages in my table. It turns out that the second part of my page gathering query -- the part that gets pages that haven't been deleted -- failed to run. I just kicked that off again and sequenced it with a query to regenerate the creations table too.


Articles for Creation workflow

[edit]

While I'm waiting for those queries to finish, I pinged TheOriginalSoni to come chat.

Draft submissions
Currently pending submissions
Accepted submissions
  • All ns=0 pages corresponding to ns=1 pages in en:Category:Accepted_AfC_submissions
    • Time accepted == Timestamp of first log_type="move" and log_action="move" log event where page was placed in ns=0
Decline submissions
  • All pages in subcategories of en:Category:Declined_AfC_submissions (note, will have to enumerate)
    • Time declined == ???
  • All deleted pages with ns=5 (Wikipedia_talk) and page_title prefixed by "Articles_for_Creation/"
    • Time declined == Timestamp of first log_type="delete" and log_action="delete" log event where page with exact title was deleted.

This approach misses:

Accepted submissions
  • All pages where the Template containing en:Category:Accepted_AfC_submissions was removed.
    • Not sure how often this might happen, but there's no good way to look for the category historically.
Declined submissions
  • User sandbox pages that have been deleted won't have "Articles_for_Creation".
    • These types of pages should be an ignorable minority.