WSoR datasets/revision diff
Location
editThe diffdb can be downloaded from dumps.wikimedia.org.
Fields
edithadoop21@beta:~/wikihadoop/diffs$ /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-10-bzip2/part-00000 | head -n 3 133350337 11406585 0 'National security and homeland security presidential directive' 1180070193 u'Begin' False 308437 u'Badagnani' 0:1:u"The '''[[National Security and Homeland Security Presidential Directive]]''' (NSPD-51/HSPD-20), signed by President [[George W. Bush]] on May 9, 2007, is a [[Presidential Directive]] giving the [[President of the United States]] near-total control over the United States in the event of a catastrophic event, without the oversight of [[United States Congress|Congress]].\n\nThe signing of this Directive was generally unnoticed by the U.S. media as well as the U.S. Congress. It is unclear how the National Security and Homeland Security Presidential Directive will reconcile with the [[National Emergencies Act]], signed in 1976, which gives Congress oversight during such emergencies.\n\n==External links==\n*[http://www.whitehouse.gov/news/releases/2007/05/20070509-12.html National Security and Homeland Security Presidential Directive], from White House site\n\n==See also==\n*[[National Emergencies Act]]\n*[[George W. Bush]]\n\n{{US-stub}}" 133350707 11406585 0 'National security and homeland security presidential directive' 1180070344 None False 308437 u'Badagnani' 906:1:u'National Security Directive]]\n*[[' 133350794 11406585 0 'National security and homeland security presidential directive' 1180070386 None False 308437 u'Badagnani' 613:-1:u'signed' 613:1:u'a U.S. federal law passed'
Each row represents a revision from the April, 2011 XML dump of the English Wikipedia. There *should* be a row for every revision that wasn't deleted when that dump was produced; however at this time, some cleanup will need to be done to remove duplicates and fill in missing revision diffs.
rev_id
: The identifier of the revision being described PRIMARY KEYpage_id
: The identifier of the page being revisednamespace
: The identifier of the namespace of the pagetitle
: The title of the page being revisedtimestamp
: The time the revision took place as a Unix epoch timestamp in secondscomment
: The edit summary left by the editorminor
: Minor status of the edit (boolean)user_id
: The identifier of the editor who saved the revisionuser_text
: The username of the editor who saved the revision- diffs - Tab separated, diff operations. Each diff operation has three parts (separated by colons):
position
: The position in the article text at which the operation took placeaction
: Did the operation add or remove some text? ("1" for add, "-1" for remove)content
: The text operated on. For added text, this is the content to add. For removed text, this is the content that was removed.
Each row can have 0-many diff operations. Values in the result set have been encoded using python's repr()
function and can be reproduced in python with the eval()
function.
Reproduction
edit- Install Hadoop, WikiHadoop and the differ.
- beta.wikiliytics.org, gamma.wikilytics.org and delta.wikilytics.org (managed by Diederik van Liere) have Hadoop 0.21, WikiHadoop 0.1 and the differ installed.
- Log in to the Hadoop master node.
- Download the Wikipedia dump files compressed in bz2 from the dump distribution site. Make sure to choose the dumps with full edit histories (pages-meta-historyN.xml.bz2).
- Copy the dump files in to HDFS using
/usr/lib/hadoop-beta/bin/hdfs dfs -copyFromLocal enwiki*.xml
- Launch a Hadoop job for each dump file using the command below.
screen -S j01diffs /usr/lib/hadoop-beta/bin/hadoop jar hadoop-0.22-streaming.jar -Dmapreduce.task.timeout=0 -Dmapred.reduce.tasks=0 -Dmapreduce.input.fileinputformat.split.minsize=290000000 -D mapreduce.map.output.compress=true -input /enwiki-20110405-pages-meta-history1.xml.bz2 -output /usr/hadoop/out-01 -mapper ~/wikihadoop/diffs/revision_differ.py -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
- With 3 nodes and 24 cores in total, one dump file of EN wiki approximately takes 20-24 hours to process.
- If you want to extract the dataset as an ordinary file, accumulate the dataset rows into one file (diffs.tsv.gz) using
/usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* > diffs.tsv
.- There are some duplicates in the results [16]. If you want to exclude those duplicates, use
/usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* | sort -n -k2 -k1 -u -T ~/tmp/ > diffs.tsv
instead. Note that~/tmp
needs to be a directory large enough to contain all the results shown with/usr/lib/hadoop-beta/bin/hdfs dfs -du /usr/hadoop/out-*/part-*
. - This may take several hours~one day depending on the size. It will be than 400 GB for EN wiki.
- There are some duplicates in the results [16]. If you want to exclude those duplicates, use
Notes
editThe dataset being generated is incomplete in two ways.
- Duplicated entries for less than 0.02% revisions (estimated). [17]
- Some revisions are failed to be diffed and marked with 'diff_fail'.