Research talk:Measuring edit productivity/Work log/2014-12-23

Tuesday, December 23, 2014

Latest comment: 10 years ago1 comment1 person in discussion

Hey folks. Lots of work in hadoop. Before I start thinking out load, let's recap.

I figured out how to do some chronological operations in hadoop. See fitting hadoop streaming into my python workflow.

I wrote a library for stream processing mediawiki data. See https://github.com/halfak/MediaWiki-Streaming.

I've been running these streaming utilities on a couple of datasets:

simplewiki-20141122-pages-meta-history
enwiki-20141106-pages-meta-history

I'm testing everything on simplewiki before I move on to enwiki. The basic flow is:

xml --> json --> diffs --> persistence --> revstats

Right now, I have simplewiki at <persistence> and enwiki working on <diffs>. I'm working out what is going on with the filesize of simplewiki's persistence. In my tests on small sample datasets, I found that persistence compressed to about the same size as diffs. This was with bzip2. In hadoop I've been working with Snappy compressed file and it's nearly 3 orders of magnitude bigger than I thought it would be. SO. I'm running a test. I'm just taking one of the result part files and recompressing it bz2.

hdfs dfs -text /user/halfak/streaming/simplewiki-20141122/persistence-snappy/part-00048.snappy \
 | bzip2 -c > /a/halfak/part-00048.bz2

We'll see how that goes. --Halfak (WMF) (talk) 22:41, 23 December 2014 (UTC)Reply