Research talk:Measuring edit productivity/Work log/2015-09-16
Add topicWednesday, September 16, 2015
[edit]It's been a while, but I haven't put this project down. I spent most of my hours on this project honing my utilities for processing content persistence. See pythonhosted.org/mwpersistence. I've been working with other researchers who are using similar strategies to track content to try to centralize on a general strategy.
Anyway, it's time to get some analysis done, so that's why I'm here today. See https://github.com/halfak/measuring-edit-productivity for code that I'll be referencing.
So, first things first, I'm updating the Makefile to allow me to use a set of Snappy files that I pulled from the hadoop clustet to stat1003 so that I can try processing the data in single-server mode.
First things first, I need to be able to process our snappy compressed files. See Phab:T112770. --Halfak (WMF) (talk) 17:31, 16 September 2015 (UTC)
- Regretfully, this is a blocker for me. So I'm going to go to hadoop and re-compress these files bz2. *sigh* --Halfak (WMF) (talk) 21:14, 16 September 2015 (UTC)
I've learned a couple of things.
- Hadoop's Snappy compression is special and therefor will not work with snzip anyway
- It's better if I just recompress the files as Bz2 in hadoop
- In order to preserve page partitioning and chronological order, I have to make hadoop re-sort the data -- even though it is already sorted.
Basically, I'm done with Snappy. I'll be converting my whole workflow to bz2 asap.
For now, I've kicked off a new job to do the recompression. --Halfak (WMF) (talk) 22:31, 16 September 2015 (UTC)
(Note: posting from the next morning)
Here's the script that I wrote:
#!/bin/bash # Gather command line args job_name=$1 input=$2 output=$3 echo "Zipping up virtualenv" cd /home/halfak/venv/3.4/ zip -rq ../3.4.zip * cd - cp /home/halfak/venv/3.4.zip virtualenv.zip echo "Moving virtualenv.zip to HDFS" hdfs dfs -put -f virtualenv.zip /user/halfak/virtualenv.zip; echo "Running hadoop job" hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-*streaming*.jar \ -D mapreduce.job.name=$job_name \ -D mapreduce.output.fileoutputformat.compress=true \ -D mapreduce.output.fileoutputformat.compress.type=BLOCK \ -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \ -D mapreduce.task.timeout=6000000 \ -D stream.num.map.output.key.fields=3 \ -D mapreduce.partition.keypartitioner.options='-k1,1n' \ -D mapreduce.job.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator" \ -D mapreduce.partition.keycomparator.options='-k1,1n -k2,2 -k3,3n' \ -D mapreduce.reduce.speculative=false \ -D mapreduce.reduce.env="LD_LIBRARY_PATH=virtualenv/lib/" \ -D mapreduce.map.env="LD_LIBRARY_PATH=virtualenv/lib/" \ -D mapreduce.map.memory.mb=1024 \ -D mapreduce.reduce.speculative=false \ -D mapreduce.reduce.memory.mb=1024 \ -D mapreduce.reduce.vcores=2 \ -D mapreduce.job.reduces=2000 \ -files hadoop/mwstream \ -archives 'hdfs:///user/halfak/virtualenv.zip#virtualenv' \ -input $input \ -output $output \ -mapper "bash -c './mwstream json2tsv page.id timestamp id -'" \ -reducer "bash -c 'cut -f4'" \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Everything went as planned and I'm pulling the data down to our stat1003 as I type. --Halfak (WMF) (talk) 14:36, 17 September 2015 (UTC)