Research talk:Revision scoring as a service/Work log/2016-02-07

Sunday, Bursary 07, 2016

edit

Today, I'm trying to label 500K human edits in Wikidata. I'm querying using this:

SELECT
  rev_id,
  trusted.ug_user IS NOT NULL AS trusted_user
FROM revision
LEFT JOIN user ON rev_user = user_id
LEFT JOIN user_groups trusted ON
  trusted.ug_user = rev_user AND
  trusted.ug_group IN (
    'bureaucrat', 'checkuser', 'flood', 'ipblock-excempt',
    'oversight', 'property-creator', 'rollbacker', 'steward',
    'sysop', 'translationadmin', 'wikidata-staff'
  )
LEFT JOIN user_groups bot ON
  bot.ug_user = rev_user AND
  bot.ug_group = 'bot'
WHERE
  rev_timestamp BETWEEN "2015" AND "2016" AND
  bot.ug_user IS NULL
ORDER BY RAND()
LIMIT 500000;

I re-wrote the above query so that I could more easily filter edits that we'd need to check for *reverted* status. Here's my updated version:

SELECT
  rev_id,
  IF(rev_comment RLIKE '/\* clientsitelink-(remove|update):', "client_edit",
    IF(rev_comment RLIKE '/\* wbmergeitems-(to|from):', "merge_edit",
    IF(trusted.ug_user IS NOT NULL, "trusted_user",
    IF(user.user_editcount IS NOT NULL AND user.user_editcount >= 1000, "trusted_edits", NULL
  )))) AS exclusion_criteria
FROM revision
LEFT JOIN user ON rev_user = user_id
LEFT JOIN user_groups trusted ON
  trusted.ug_user = rev_user AND
  trusted.ug_group IN (
    'bureaucrat', 'checkuser', 'flood', 'ipblock-excempt',
    'oversight', 'property-creator', 'rollbacker', 'steward',
    'sysop', 'translationadmin', 'wikidata-staff'
  )
LEFT JOIN user_groups bot ON
  bot.ug_user = rev_user AND
  bot.ug_group = 'bot'
WHERE
  rev_timestamp BETWEEN "2015" AND "2016" AND
  bot.ug_user IS NULL
ORDER BY RAND()
LIMIT 500000;

So, I think that I'll just use grep to filter this when passing it to label_reverted. (e.g. grep -P "[0-9]+\tNULL" | editquality label_reverted ... I see that Ladsgroup started some work for filtering "reverted" status by the reverting edit's comment. See https://github.com/wiki-ai/editquality/pull/14

While the query is running, I'll start working on the code now to get it up to spec.  :) --EpochFail (talk) 16:45, 7 February 2016 (UTC)Reply

Labeling the sample

edit
$ cat datasets/revision_sample.nonbot_with_exclusions.500k_2015.tsv |         grep -P "[0-9]+\tNULL" | wc
  22628   45256  339420

So, we have 22.6k revisions that need to be labeled. User:Ladsgroup and I just beefed up the label_reverted script so that we can exclude "reverts" that come from client edits. Now, it's time to run it on this set of edits. --EpochFail (talk) 18:27, 7 February 2016 (UTC)Reply

Return to "Revision scoring as a service/Work log/2016-02-07" page.