Research talk:Automated classification of edit quality/Work log/2016-04-13
Latest comment: 8 years ago by EpochFail in topic Wednesday, April 13, 2016
Wednesday, April 13, 2016
editGenerating prelabeled data for hungarian and swedish today.
$ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep True | wc 1848 5879 38077 $ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc 285 1140 7980
Looks like we don't have enough True observations for Hungarian. So, we'll need to boost that to ~40k observations. Here's the updated query: http://quarry.wmflabs.org/query/8811 Once that finishes, I'll try again. For now, let's check on Swedish.
$ wc datasets/svwiki.revisions_for_review.20k_2016.tsv 4024 14935 104654 datasets/svwiki.revisions_for_review.20k_2016.tsv $ cat datasets/svwiki.revisions_for_review.20k_2016.tsv | grep True | wc 1523 4932 32127 $ cat datasets/svwiki.revisions_for_review.20k_2016.tsv | grep reverted | wc 286 1144 8008
Same story here. Let's boost the observations to 40k. Here it is: http://quarry.wmflabs.org/query/8810
Now to go back to huwiki.
$ wc datasets/huwiki.revisions_for_review.5k_2016.tsv 5001 17898 123527 datasets/huwiki.revisions_for_review.5k_2016.tsv $ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep True | wc 2500 7895 51000 $ cat datasets/huwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc 340 1360 9520
Looks good.
Now for svwiki.
$ wc datasets/svwiki.revisions_for_review.5k_2016.tsv 5001 18097 125232 datasets/svwiki.revisions_for_review.5k_2016.tsv $ cat datasets/svwiki.revisions_for_review.5k_2016.tsv | grep True | wc 2500 8094 52705 $ cat datasets/svwiki.revisions_for_review.5k_2016.tsv | grep reverted | wc 453 1812 12684
Cool! Ready to go. --EpochFail (talk) 17:13, 13 April 2016 (UTC)