Research talk:Wikipedia article creation/Work log/Monday, November 25th
Add topicMonday, November 25th
[edit]Today, I'm going to try to replicate the change in the proportion of deleted pages for non-English wikis. The real problem is that the database servers for non-English Wikipedia do not allow for temporary table -- the kind of scratch space I need in order to move forward.
My plan is to extract exactly what I need to a local database server that I can use to perform the requisite joins. I've figured out how to extract page info as well as user stats, but I can't seem to figure out how to extract the revision that created the page so I can join with the user table and split newcomers from oldtimers.
Page status query
|
---|
SELECT
page_id,
page_namespace,
page_title,
revisions,
first_revision.
last_revision,
archived,
(
archived AND
UNIX_TIMESTAMP(last_revision) - UNIX_TIMESTAMP(first_revision) < 60*60*24*30
) AS archived_quickly,
creation.rev_id AS creating_rev_id,
creation.rev_timestamp AS creating_rev_timestamp,
user_id AS creator_id,
IFNULL(creatore.user_name, creation.rev_user_text) AS creator_name,
IFNULL(direct_create.log_action, indirect_create.log_action) AS creator_creation_action,
user_registration AS creator_registration
FROM (
SELECT
ar_page_id AS page_id,
ar_namespace AS page_namespace,
ar_title AS page_title,
COUNT(*) AS revisions,
MIN(ar_timestamp) AS first_revision,
MAX(ar_timestamp) AS last_revision,
True AS archived,
MIN(ar_rev_id) AS first_rev_id
FROM archive
WHERE ar_timestamp < "20131105000000"
UNION
SELECT
page_id,
page_namespace,
page_title,
COUNT(*) AS revisions,
MIN(rev_timestamp) AS first_edit,
MAX(rev_timestamp) AS last_edit,
False AS archived,
MIN(rev_id) AS first_rev_id
FROM revision
INNER JOIN page ON page_id = rev_page
WHERE rev_timestamp < "20131105000000"
GROUP BY 1,2,3
) as page
LEFT JOIN revision creation ON rev_id = first_rev_id
LEFT JOIN user creator ON rev_user = user_id
LEFT JOIN logging AS direct_create ON
direct_create.log_type = "newusers" AND
creator.user_id = direct_create.log_user AND
direct_create.log_action IN ("create", "autocreate", "newusers")
LEFT JOIN logging AS indirect_create ON
indirect_create.log_type = "newusers" AND
indirect_create.log_action IN ("byemail", "create2") AND
REPLACE(creator.user_name, " ", "_") = indirect_create.log_title;
|
Alright. This one is a monster. Here's what it is supposed to get:
(one row per page)
- page_id: Page identifier (if not archived)
- page_namespace: "
- page_title: "
- revisions: Number of revisions saved
- first_revision: Date of the first revision
- last_revision: Date of the last revision
- archived: True if in archive table, false otherwise
- archived_quickly: True if archived within 30 days of creation (judging by last rev_timestamp)
- creating_rev_id: rev_id of the creating revision
- creating_rev_timestamp: "
- creator_id: user_id of the creator (or NULL for anons)
- creator_name: user_name of the creator (or IP for anons)
- creator_action: log_action for relevant account creations
- creator_registration: registration date of creator (or NULL for anons)
Time to see how quickly I can get in trouble for unleashing this.
19:39, 25 November 2013 (UTC)