Research talk:Measuring edit productivity/Work log/2015-01-5
Add topicAppearance
Latest comment: 9 years ago by Halfak (WMF) in topic Monday, January 5, 2015
Monday, January 5, 2015
[edit]Just got back from the holiday and I'm picking up where I left off. So, it looks like the text trimming script that I ran worked as expected. So, next I want to re-try the revision stats job on simplewiki. First, let's check the filesize change.
[halfak@stat1002: ~/projects/persistence] $ du -hs /mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-notext-snappy/ 23G /mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-notext-snappy/ [halfak@stat1002: ~/projects/persistence] $ du -hs /mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-snappy 11T /mnt/hdfs/user/halfak/streaming/simplewiki-20141122/persistence-snappy
Well... Hmm.. That's a pretty massive difference. :) It makes sense since the diffs naturally compress changes. This should substantially reduce the storage space needed to sort and partition the data. Time to try again. --Halfak (WMF) (talk) 17:58, 5 January 2015 (UTC)