Research talk:Measuring edit productivity/Work log/2015-09-16
Wednesday, September 16, 2015
editIt's been a while, but I haven't put this project down. I spent most of my hours on this project honing my utilities for processing content persistence. See I've been working with other researchers who are using similar strategies to track content to try to centralize on a general strategy.
Anyway, it's time to get some analysis done, so that's why I'm here today. See for code that I'll be referencing.
So, first things first, I'm updating the Makefile to allow me to use a set of Snappy files that I pulled from the hadoop clustet to stat1003 so that I can try processing the data in single-server mode.
First things first, I need to be able to process our snappy compressed files. See Phab:T112770. --Halfak (WMF) (talk) 17:31, 16 September 2015 (UTC)
- Regretfully, this is a blocker for me. So I'm going to go to hadoop and re-compress these files bz2. *sigh* --Halfak (WMF) (talk) 21:14, 16 September 2015 (UTC)
I've learned a couple of things.
- Hadoop's Snappy compression is special and therefor will not work with snzip anyway
- It's better if I just recompress the files as Bz2 in hadoop
- In order to preserve page partitioning and chronological order, I have to make hadoop re-sort the data -- even though it is already sorted.
Basically, I'm done with Snappy. I'll be converting my whole workflow to bz2 asap.
For now, I've kicked off a new job to do the recompression. --Halfak (WMF) (talk) 22:31, 16 September 2015 (UTC)
(Note: posting from the next morning)
Here's the script that I wrote:
#!/bin/bash # Gather command line args job_name=$1 input=$2 output=$3 echo "Zipping up virtualenv" cd /home/halfak/venv/3.4/ zip -rq ../ * cd - cp /home/halfak/venv/ echo "Moving to HDFS" hdfs dfs -put -f /user/halfak/; echo "Running hadoop job" hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-*streaming*.jar \ -D$job_name \ -D mapreduce.output.fileoutputformat.compress=true \ -D mapreduce.output.fileoutputformat.compress.type=BLOCK \ -D \ -D mapreduce.task.timeout=6000000 \ -D \ -D mapreduce.partition.keypartitioner.options='-k1,1n' \ -D mapreduce.job.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator" \ -D mapreduce.partition.keycomparator.options='-k1,1n -k2,2 -k3,3n' \ -D mapreduce.reduce.speculative=false \ -D mapreduce.reduce.env="LD_LIBRARY_PATH=virtualenv/lib/" \ -D"LD_LIBRARY_PATH=virtualenv/lib/" \ -D \ -D mapreduce.reduce.speculative=false \ -D mapreduce.reduce.memory.mb=1024 \ -D mapreduce.reduce.vcores=2 \ -D mapreduce.job.reduces=2000 \ -files hadoop/mwstream \ -archives 'hdfs:///user/halfak/' \ -input $input \ -output $output \ -mapper "bash -c './mwstream json2tsv timestamp id -'" \ -reducer "bash -c 'cut -f4'" \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Everything went as planned and I'm pulling the data down to our stat1003 as I type. --Halfak (WMF) (talk) 14:36, 17 September 2015 (UTC)