Research talk:Newcomer task suggestions/Work log/2014-09-30
Add topicTuesday, September 30th
[edit]Today, I'm generating data to address the following question:
How do the types of edits made in each condition differ?
- Compare proportion of main namespace edits between conditions.
- Compare the rate of section edits between conditions (parse comment).
- Compare the # of bytes changed between conditions (rev_len - parent.rev_len).
- Compare the length of pages at at the time edit between conditions (parent.rev_len).
Here are the queries I came up with:
experimental_user_revision
|
---|
SELECT
DATABASE() AS wiki,
user_id,
revision.rev_id,
revision.rev_comment,
page_namespace,
CAST(revision.rev_len AS INT) -
CAST(IFNULL(parent.rev_len, 0) AS INT) AS bytes_changed,
IFNULL(parent.rev_len, 0) AS previous_bytes
FROM staging.tr_experimental_user
INNER JOIN revision FORCE INDEX (user_timestamp) ON
user_id = rev_user AND
revision.rev_timestamp BETWEEN
user_registration AND
DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 7 DAY),
"%Y%m%d%H%i%S")
LEFT JOIN revision parent ON
parent.rev_id = revision.rev_parent_id
INNER JOIN page ON
revision.rev_page = page_id
WHERE
wiki = DATABASE();
|
experimental_user_edit_stats
|
---|
SELECT
wiki,
user_id,
SUM(rev_id IS NOT NULL) AS revisions_saved,
SUM(page_namespace = 0) AS mainspace_edits,
SUM(rev_comment LIKE "/* % */%") AS main_section_edits,
SUM(IF(page_namespace = 0 AND bytes_changed > 0, bytes_changed, 0)) AS
main_bytes_added ,
SUM(IF(page_namespace = 0 AND bytes_changed < 0, bytes_changed, 0)) AS
main_bytes_removed,
SUM(bytes_changed) AS main_bytes_changed,
AVG(IF(page_namespace = 0, previous_bytes, NULL)) AS
mean_mainspace_previous_bytes
FROM staging.tr_experimental_user
INNER JOIN staging.tr_experimental_user_revision USING (wiki, user_id)
GROUP BY wiki, user_id;
|
The first query was hell to optimize. Suffice it to say that MySQL needed to be coerced into using an index through the removal of another. See [1] if you are curious.
Now, I think that I'll actually try to generate the per-user stats in R since I only have about 50k rows to deal with. --Halfak (WMF) (talk) 22:09, 30 September 2014 (UTC)
So there's a problem with comparing proportions between users. Users who made many edits have higher weight than editors who have made few edits. To control for this, I randomly sampled one edit per user to generate stats from. Given the large amount of observations, this shouldn't be a problem.
Compare proportion of main namespace edits between conditions
[edit]Except for the low numbers of observations in small wikis, there's no consistent trends here.
Compare the rate of section edits between conditions (parse comment)
[edit]No consistent trends here either.
Compare the # of bytes changed between conditions (rev_len - parent.rev_len)
[edit]Again, no consistent trends. The only differences appear in the cases where there are vanishingly few observations.
Compare the length of pages at at the time edit between conditions (parent.rev_len)
[edit]Still no trend. It looks like, regardless of the intervention, newcomers edit the same proportion of mainspace articles. They make the same proportion of section edits. They change about the same amount of content per edit. And they edit the same size pages. We're here again with no effect. :\ --Halfak (WMF) (talk) 22:28, 30 September 2014 (UTC)