Jump to content

Research:Generating a common editor sample

From Meta, a Wikimedia project coordination wiki
  • Produce a first attempt at a random sample of users that we can use throughout the summer for qualitative observation. (The qualitative observations can eventually build datasets that can be used as training data for models.)
  • Using this new sample, do the same kind of analysis I did last week on editing sprints as a first iteration of adding new aggregate data to this sample.

Steps:

1.

[edit]

The query below took a whole day to run on toolserver. My goal was to get a sample with at least 1000 users per year in the "100's" and "1000's" (of edits) categories of each year.

use enwiki_p;
create table u_zexley_p.newsamp
SELECT /* SLOW_OK */ r.rev_user as user_id, count(rev_id) as rev_count, min(rev_id) as min_rev_id,max(rev_id) as max_rev_id
FROM revision as r
GROUP BY r.rev_user
ORDER BY RAND()
LIMIT 200000;

I should have done some better guestimation before running the query. My 200,000 limit got me only ~1300 total 1000's and only ~5000 100's. After talking with Aaron, it occurred to him to make an aggregate table that indexes the whole user table by edit count, date of first edit, and maybe also edit count in first year, month, etc.... That would be useful! He's seeing how long that will take. In the mean time, I'm just running another query that will hopefully give me enough people in those categories, but Aaron says this will take the same whole day to run probably:

use enwiki_p;
insert into u_zexley_p.newsamp
SELECT /* SLOW_OK */ r.rev_user as user_id, count(rev_id) as rev_count, min(rev_id) as min_rev_id,max(rev_id) as max_rev_id
FROM revision as r
GROUP BY r.rev_user
HAVING count(rev_id) > 99
ORDER BY RAND()
LIMIT 50000;