Research talk:AfC processes and productivity/Work log/2014-04-05
Add topicSaturday, April 5th
[edit]Today, my goal is to get a data processing workflow together that will produce a datasets to work with.
- AfC status change
- * page_id
- * page_namespace
- * page_title
- * status
- * timestamp
- * rev_id
- * duration
- * old_status
I have two primary datasources for getting text: the dumps and the API. Given the speed with which I can process dumps, my plan is to process a recent dump first and then to try to request the rest of the data (deleted pages) from the API.
I've already gathered all "AfC pages" from the database with the following SQL:
CREATE TABLE halfak.afc_page_20140331
SELECT
page_id,
page_namespace,
page_title,
archived,
identified_by
FROM (
(
SELECT
page_id,
page_namespace,
page_title,
FALSE AS archived,
"namespace and title" AS identified_by
FROM page
WHERE page_namespace = 5
AND page_title LIKE "Articles_for_creation/%"
)
UNION
(
SELECT
ar_page_id,
ar_namespace,
ar_title,
TRUE AS archived,
"namespace and title (archive)" AS identified_by
FROM archive
WHERE ar_namespace = 5
AND ar_title LIKE "Articles_for_creation/%"
GROUP BY 1,2,3
)
UNION
(
SELECT
main_page.page_id,
main_page.page_namespace,
main_page.page_title,
FALSE AS archived,
"categories" AS identified_by
FROM page talk_page
INNER JOIN page main_page ON
main_page.page_namespace = 0 AND
main_page.page_title = talk_page.page_title
INNER JOIN categorylinks ON
cl_from = talk_page.page_id AND
cl_to IN (
"Accepted AfC submissions",
"Declined_AfC_submissions",
"Pending_AfC_submissions",
"Draft_AfC_submissions"
)
WHERE talk_page.page_namespace = 1
GROUP BY 1,2,3
)
) AS group_set
GROUP BY 1,2,3,4;
--EpochFail (talk) 16:36, 5 April 2014 (UTC)
Since I'm going to have to process main article pages and talk page independently and there is no guarantee which one I'll see first in the dumps, I left the duration computation for a second pass over the dataset. For now, I'm just extracting the following fields:
- page_id
- page_namespace
- page_title
- rev_id
- timestamp
- status
- source (either "id match" or "talk page")
I pushed the updated code up to the repo here: https://github.com/halfak/Articles-for-Creation --EpochFail (talk) 18:31, 5 April 2014 (UTC)