Research talk:Measuring edit productivity/Work log/2015-09-29
Add topicWednesday, September 30, 2015
[edit]So, I've been working with the output a little bit and I found a problem. It turns out that the diff algorithm behaves strangely when you re-use the abstract segment tree in two diffs. While I've already fixed the issue and added tests, the problem is in the diff algorithm -- the beginning of the pipeline -- so that means I need to re-run the diffs that I had previously ran. *sigh*. So, regretfully, I won't be able to do much of an interesting analysis with this dataset in the short-term. However I did spend some time honing my analysis techniques on the data I do have. So I'll take a little bit of time to go over that here.
All of the plots I am about to post are based on data that may be slightly or extremely wrong. Consume them with caution. |
First, let's look at the survival of tokens by the number of seconds they remain visible. I can imagine two good strategies for looking at this: a density plot of the time of removal and a death hazard plot.
Well that's interesting. I suppose it makes sense that a hazard plot will look nearly identical to a density plot of the death times, but I didn't expect that they'd look identical! So, one feature that jumps out is the cycle of peaks every 24 hours. I bet that's because of the cyclical patterns of activity (and watchlist views) that happens as the earth turns. It looks like the vast majority of the density/hazard goes away after a few hours. That corresponds to my previous observations about quality control in enwiki (see When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes?).
OK. Onto the number of revisions that tokens persist before they are removed (just doing the hazard this time).
Here, we see a much more regular pattern. It looks like most of the hazard drops away after a few followup revisions, but the hazard continues to fall with additional revisions. So it looks like a good threshold for damage would be set at 3-5 revisions, but assuming the hazard decay represents some quality aspect of the contribution, more information can be learned about that aspect by observing long-term revision persistence. It'll be fun to look at what revisions these tokens are a part of to dig in further. It'll also be fun to see how this changes once I complete a new run of diffs against enwiki.