Research talk:Measuring edit productivity/Work log/2015-09-18
Add topicAppearance
Latest comment: 9 years ago by Halfak (WMF) in topic Friday, September 18, 2015
Friday, September 18, 2015
[edit]OK! So I'm running a job on stat1003 and I've learned about two issues.
- is that the output queue used in para to parallelize the processing work needs a fixed size or memory is going to become a huge issue. When I run the job on a single file (no output queue), memory usage is minimal.
- this problem implies that the mappers can produce output far faster than the bzip2 stream can write. That means we need to multiprocess the compression of bzip2. I filed a feature request to add that. I'll be digging into that primarily today.