Research talk:Measuring edit productivity/Work log/2015-04-15
Add topicWednesday, April 15, 2015
[edit]The diff job finished! Here's the hadoop stats:
File System Counters FILE: Number of bytes read=11992158169291 FILE: Number of bytes written=11342016265314 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=634498881239 HDFS: Number of bytes written=337764574821 HDFS: Number of read operations=13317 HDFS: Number of large read operations=0 HDFS: Number of write operations=4000 Job Counters Launched map tasks=2439 Launched reduce tasks=2000 Data-local map tasks=2438 Rack-local map tasks=1 Total time spent by all maps in occupied slots (ms)=446505506700 Total time spent by all reduces in occupied slots (ms)=26616033150 Total time spent by all map tasks (ms)=44650550670 Total time spent by all reduce tasks (ms)=2661603315 Total vcore-seconds taken by all map tasks=44650550670 Total vcore-seconds taken by all reduce tasks=2661603315 Total megabyte-seconds taken by all map tasks=228610819430400 Total megabyte-seconds taken by all reduce tasks=13627408972800 Map-Reduce Framework Map input records=583741359 Map output records=415592383 Map output bytes=8579023624969 Map output materialized bytes=3778907328006 Input split bytes=508991 Combine input records=0 Combine output records=0 Reduce input groups=415592383 Reduce shuffle bytes=3778907328006 Reduce input records=415592383 Reduce output records=415592383 Spilled Records=1246338388 Shuffled Maps =4878000 Failed Shuffles=0 Merged Map outputs=4878000 GC time elapsed (ms)=183793619 CPU time spent (ms)=45173296420 Physical memory (bytes) snapshot=6270283120640 Virtual memory (bytes) snapshot=14864163971072 Total committed heap usage (bytes)=8861103685632 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=634498372248 File Output Format Counters Bytes Written=337764574821 15/04/09 14:00:04 INFO streaming.StreamJob: Output directory: /user/halfak/streaming/enwiki-20141106/diffs-snappy real 8408m36.464s user 4m50.086s sys 7m25.772s
I had to implement some diff timeouts in order to get it to finish. For that reason, there's some edits that have no diff. I wasn't able to find them in with a simple grep for "diff: null", so I'm just going to kick off the persistence job and see how it goes while I prepare to perform an analysis of the diffs. --Halfak (WMF) (talk) 16:15, 15 April 2015 (UTC)
I had to make some modifications, but the script is now started. In the meantime, I want to (1) confirm that all the diffs are in fact not "null" and (2) plot the diff timing data that I extracted. --Halfak (WMF) (talk) 16:28, 15 April 2015 (UTC)
Well... it looks like I should have been looking for "ops: null". Oh well. Let's grab a sample and start working with it.
So, I randomly sampled 100k revisions from the first reducer. That might result in some bias. I'm not sure. So I'll do come analysis on this while I pull a larger sample.
OK.
Well, that looks fast to me.
Let's look at some stats.
quantile(diff_stats$diff.time) # 0% 25% 50% 75% 100% #0.00 0.02 0.05 0.13 3.85 summary(diff_stats$truncated) # False #100000
Cool. It looks like we're performing about right for my expectations. Now I'm just waiting for the proper sample to finish. --Halfak (WMF) (talk) 18:40, 15 April 2015 (UTC)
Looks like the persistence generator failed. That was because I changed the format of diffs in order to track stats. I've released a new version of mwstreaming (0.5.5) to fix this and restarted the job. --Halfak (WMF) (talk) 18:41, 15 April 2015 (UTC)
Well, it's all running, but this is going to take at least a couple of hours, so I'm going to go work on other things. If the sample finishes today, I'll update here. If not, look for future worklogs. --Halfak (WMF) (talk) 20:18, 15 April 2015 (UTC)
Update from the FUTURE! The proper sample completed. It looks like stats didn't change in any meaningful way, but I did update the #Diff time density plot above. --Halfak (WMF) (talk) 17:38, 16 April 2015 (UTC)