Research talk:Measuring edit productivity/Work log/2014-12-23
Add topicTuesday, December 23, 2014
[edit]Hey folks. Lots of work in hadoop. Before I start thinking out load, let's recap.
I figured out how to do some chronological operations in hadoop. See fitting hadoop streaming into my python workflow.
I wrote a library for stream processing mediawiki data. See https://github.com/halfak/MediaWiki-Streaming.
I've been running these streaming utilities on a couple of datasets:
- simplewiki-20141122-pages-meta-history
- enwiki-20141106-pages-meta-history
I'm testing everything on simplewiki before I move on to enwiki. The basic flow is:
xml --> json --> diffs --> persistence --> revstats
Right now, I have simplewiki at <persistence> and enwiki working on <diffs>. I'm working out what is going on with the filesize of simplewiki's persistence. In my tests on small sample datasets, I found that persistence compressed to about the same size as diffs. This was with bzip2. In hadoop I've been working with Snappy compressed file and it's nearly 3 orders of magnitude bigger than I thought it would be. SO. I'm running a test. I'm just taking one of the result part files and recompressing it bz2.
hdfs dfs -text /user/halfak/streaming/simplewiki-20141122/persistence-snappy/part-00048.snappy \ | bzip2 -c > /a/halfak/part-00048.bz2
We'll see how that goes. --Halfak (WMF) (talk) 22:41, 23 December 2014 (UTC)