WSoR datasets/revert
The revert dataset contains information about reverting edits and the revision that was reverted back to. Reverts are identified by looking for revisions with identical content in the history of a page.
Location
editdb42:halfak.revert_20110115
Fields
edithalfak@internproxy:~$ mysql -h db42 -e "EXPLAIN revert_20110115;SELECT * FROM revert_20110115 LIMIT 3" halfak +---------------+------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+------------+------+-----+---------+-------+ | rev_id | int(11) | YES | UNI | NULL | | | rvtto_id | int(11) | YES | | NULL | | | is_vandalism | tinyint(1) | YES | MUL | NULL | | | revs_reverted | int(11) | YES | | NULL | | +---------------+------------+------+-----+---------+-------+ +--------+----------+--------------+---------------+ | rev_id | rvtto_id | is_vandalism | revs_reverted | +--------+----------+--------------+---------------+ | 60 | 43 | 0 | 1 | | 309 | 308 | 0 | 2 | | 323 | 318 | 0 | 1 | +--------+----------+--------------+---------------+
There is a row in this table for every reverting revision in a page that was included in the January, 2011 database dump of enwiki.
rev_id
: The reverting revision's identifierrvtto_id
: The identifier of the revision that was reverted tois_vandalism
: True when the D_STRICT/D_LOOSE regexp matched the reverting revision's comment, False otherwise.revs_reverted
: The number of revisions between the reverting revision and the reverted to revision. This represents the number of revisions that were reverted. Note that this is limited to < 15 to remove noise.
Reproduction
editTo reproduce this dataset, install Wikimedia Utilities and run this script.
For example:
$ python reverts.py --output_prefix=test1/ enwiki.*.7z Aug-29 16:51:36 INFO Starting test1/ run... Aug-29 16:51:36 INFO Creating output file: test1/revert.tsv Aug-29 16:51:36 INFO Creating output file: test1/reverted.tsv Aug-29 16:51:36 INFO Prcoessing... Aug-29 16:51:36 INFO Processing dump file enwiki-20110829-pages-meta-history.sample.xml. |.||||..|.|....|......|........|..........|...........|.|.|.|.|...|.|||...|.|.|
This will produce two output files that can be loaded into the database test1/revert.tsv
and test1/reverted.tsv
. Create the table to load into:
CREATE TABLE revert (
rev_id INT UNSIGNED,
rvtto_id INT UNSIGNED,
is_vandalism BOOL,
revs_reverted INT(2)
);
Then load the output file:
$ mysqlimport --local --skip-opt halfak test1/revert.tsv
This gets the base dataset into the MySQL so it can be denormalized. To denormalize, run the following command:
CREATE TABLE revert_denorm
SELECT
rvtg.rev_id AS reverting_id,
rvtg.rev_user_text AS reverting_username,
rvtg.rev_user AS reverting_user_id,
rvtg.rev_comment AS reverting_comment,
rvtto.rev_id AS revertto_id,
rvtto.rev_user_text AS revertto_username,
rvtto.rev_user AS revertto_user_id,
rvtto.rev_comment AS revertto_comment,
revert.revs_reverted
FROM revert
LEFT JOIN revision rvtg
ON revert.rev_id = rvtg.rev_id
LEFT JOIN revision rvtto
ON revert.rvtto_id = rvtto.rev_id;
Finally, move the denormalized table over the original.
DROP TABLE revert;
ALTER TABLE revert_denorm RENAME TO revert;
Enjoy!
Notes
editThe current dataset is complete up to the January 2011 dump.