Research talk:Automated classification of edit quality/Work log/2016-02-28
Sunday, February 28, 2016
editI should start doing my edit quality-related worklogs here. So, I'm going to start with my work to extract a labeling set for Norwegian, Hebrew and Vietnamese Wikipedias. I'm trying to build a set of 2.5k "needs review" and 2.5k "trusted" edits for labeling.
OK. So first of all, I need to label enough edits that we get at least 2.5k that "need review". I usually start with a random sample of 20k edits, but some wikis are so dominated by bots and privileged users that I need bigger samples. In this case, nowiki and viwiki were this way. Here's how the pre-labeling process worked out:
$ cat datasets/nowiki.prelabeled_revisions.100k_2015.tsv | grep True | wc 7351 23881 155890 $ cat datasets/nowiki.prelabeled_revisions.100k_2015.tsv | grep False | wc 92642 370568 2686618 $ cat datasets/viwiki.prelabeled_revisions.100k_2015.tsv | grep True | wc 8141 25911 167614 $ cat datasets/viwiki.prelabeled_revisions.100k_2015.tsv | grep False | wc 91849 367396 2663621 $ cat datasets/hewiki.prelabeled_revisions.20k_2015.tsv | grep True | wc 4166 13401 87151 $ cat datasets/hewiki.prelabeled_revisions.20k_2015.tsv | grep False | wc 15798 63192 458142
need review | reverted | trusted | |
---|---|---|---|
nowiki | 7351 (7.4%) | 1597 (1.6%) | 92642 (92.6%) |
viwiki | 8141 (8.1%) | 1031 (1.0%) | 91849 (91.9%) |
hewiki | 4166 (20.9%) | 773 (3.9%) | 15798 (79.1%) |
OK. Now to generate the 5k sets of 2.5/2.5k needing review/trusted. Here's the basic pattern demonstrated on nowiki's prelabeled set:
(echo "rev_id\tneeds_review\treason"; \ (cat datasets/nowiki.prelabeled_revisions.20k_2015.tsv | grep True | \ shuf -n 2500; \ cat datasets/nowiki.prelabeled_revisions.20k_2015.tsv | grep False | \ shuf -n 2500 \ ) | shuf \ ) > datasets/nowiki.revisions_to_review.5k_2015.tsv
And here's the three datasets:
$ wc *.revisions_to_review.* 5001 18048 124856 hewiki.revisions_to_review.5k_2015.tsv 5001 18107 125387 nowiki.revisions_to_review.5k_2015.tsv 5001 17964 124035 viwiki.revisions_to_review.5k_2015.tsv 15003 54119 374278 total
Now to load them into Wiki labels, I'll need to make sure all the language assets are in order. --EpochFail (talk) 20:35, 28 February 2016 (UTC)
Updating UI.
editHere's the pull for updating the Wikilabels UI: https://github.com/wiki-ai/wikilabels/pull/95 --EpochFail (talk) 20:43, 28 February 2016 (UTC) Here's the pull for updating the damaging_and_goodfaith form: https://github.com/wiki-ai/wikilabels-wikimedia-config/pull/11 --EpochFail (talk) 20:48, 28 February 2016 (UTC)
Loading the campaigns
editOK. Looks like we've deployed successfully. Now I'm back to loading the campaigns into the database.
u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('איכות ערוכה ( 5k מאוזן )', 'hewiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True); INSERT 0 1 u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('Sửa chất lượng ( 5k cân bằng)', 'viwiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True); INSERT 0 1 u_wikilabels=> INSERT INTO campaign (name, wiki, form, view, created, labels_per_task, tasks_per_assignment, active) VALUES ('Edit kvalitet ( 5k balansert)', 'nowiki', 'damaging_and_goodfaith', 'DiffToPrevious', NOW(), 1, 50, True); INSERT 0 1 u_wikilabels=> SELECT id, name, wiki FROM campaign WHERE wiki IN ('hewiki', 'nowiki', 'viwiki'); id | name | wiki ----+-------------------------------+-------- 25 | איכות ערוכה ( 5k מאוזן ) | hewiki 26 | Sửa chất lượng ( 5k cân bằng) | viwiki 27 | Edit kvalitet ( 5k balansert) | nowiki (3 rows)
OK. Time to do some loading.
halfak@wikilabels-01:~/datasets$ cat hewiki.revisions_to_review.5k_2015.tsv | /srv/wikilabels/venv/bin/wikilabels task_inserts 25 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W Password for user u_wikilabels: INSERT 0 5000 halfak@wikilabels-01:~/datasets$ cat viwiki.revisions_to_review.5k_2015.tsv | /srv/wikilabels/venv/bin/wikilabels task_inserts 26 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W Password for user u_wikilabels: INSERT 0 5000 halfak@wikilabels-01:~/datasets$ cat nowiki.revisions_to_review.5k_2015.tsv | /srv/wikilabels/venv/bin/wikilabels task_inserts 27 | psql -h wikilabels-database --user u_wikilabels u_wikilabels -W Password for user u_wikilabels: INSERT 0 5000
OK. We should be good to go.
- no:Wikipedia:Etiketter (confirmed)
- he:ויקיפדיה:סיווג עריכות (confirmed, but some encoding issues Phab:T128339)
- vi:Wikipedia:Nhãn (confirmed)
I'm declaring victory for today. --EpochFail (talk) 21:30, 28 February 2016 (UTC)