Research talk:Revision scoring as a service/Work log/2016-02-07
Sunday, Bursary 07, 2016
editToday, I'm trying to label 500K human edits in Wikidata. I'm querying using this:
SELECT
rev_id,
trusted.ug_user IS NOT NULL AS trusted_user
FROM revision
LEFT JOIN user ON rev_user = user_id
LEFT JOIN user_groups trusted ON
trusted.ug_user = rev_user AND
trusted.ug_group IN (
'bureaucrat', 'checkuser', 'flood', 'ipblock-excempt',
'oversight', 'property-creator', 'rollbacker', 'steward',
'sysop', 'translationadmin', 'wikidata-staff'
)
LEFT JOIN user_groups bot ON
bot.ug_user = rev_user AND
bot.ug_group = 'bot'
WHERE
rev_timestamp BETWEEN "2015" AND "2016" AND
bot.ug_user IS NULL
ORDER BY RAND()
LIMIT 500000;
I re-wrote the above query so that I could more easily filter edits that we'd need to check for *reverted* status. Here's my updated version:
SELECT
rev_id,
IF(rev_comment RLIKE '/\* clientsitelink-(remove|update):', "client_edit",
IF(rev_comment RLIKE '/\* wbmergeitems-(to|from):', "merge_edit",
IF(trusted.ug_user IS NOT NULL, "trusted_user",
IF(user.user_editcount IS NOT NULL AND user.user_editcount >= 1000, "trusted_edits", NULL
)))) AS exclusion_criteria
FROM revision
LEFT JOIN user ON rev_user = user_id
LEFT JOIN user_groups trusted ON
trusted.ug_user = rev_user AND
trusted.ug_group IN (
'bureaucrat', 'checkuser', 'flood', 'ipblock-excempt',
'oversight', 'property-creator', 'rollbacker', 'steward',
'sysop', 'translationadmin', 'wikidata-staff'
)
LEFT JOIN user_groups bot ON
bot.ug_user = rev_user AND
bot.ug_group = 'bot'
WHERE
rev_timestamp BETWEEN "2015" AND "2016" AND
bot.ug_user IS NULL
ORDER BY RAND()
LIMIT 500000;
So, I think that I'll just use grep to filter this when passing it to label_reverted. (e.g. grep -P "[0-9]+\tNULL" | editquality label_reverted ...
I see that Ladsgroup started some work for filtering "reverted" status by the reverting edit's comment. See https://github.com/wiki-ai/editquality/pull/14
While the query is running, I'll start working on the code now to get it up to spec. :) --EpochFail (talk) 16:45, 7 February 2016 (UTC)
Labeling the sample
edit$ cat datasets/revision_sample.nonbot_with_exclusions.500k_2015.tsv | grep -P "[0-9]+\tNULL" | wc 22628 45256 339420
So, we have 22.6k revisions that need to be labeled. User:Ladsgroup and I just beefed up the label_reverted
script so that we can exclude "reverts" that come from client edits. Now, it's time to run it on this set of edits. --EpochFail (talk) 18:27, 7 February 2016 (UTC)