Research talk:Revision scoring as a service/Work log/2015-05-14
Thursday, May 14, 2015
editToday, I'm trying to figure out how I can effectively reduce the size of our tasks for labeling. My plan is to identify all of the revisions that are almost certainly not vandalism. My plan is to look at the user_groups table for sysops, bots, rollbacker and patroller rights and compare those to revert rates.
So, I've taking the 20k revision datasets and ran them through ore's 'label_reverted' utility to gather which edits have been reverted. My plan is to look at the proportion of edits for each group and filter the edits by users who belong to the practically-never-reverted groups.
Note that I'm impatient, so I started this analysis when the script had only completed looking up the reverted status of 6179 of the 20k edits.
> select distinct ug_group from rev_reverted_20k_sample INNER JOIN enwiki.revision USING(rev_id) INNER JOIN enwiki.user_groups ON rev_user = ug_user WHERE wiki = 'enwiki'; +--------------------+ | ug_group | +--------------------+ | bot | | autoreviewer | | sysop | | ipblock-exempt | | reviewer | | rollbacker | | filemover | | templateeditor | | eponline | | abusefilter | | bureaucrat | | massmessage-sender | | import | | checkuser | | oversight | | accountcreator | | epcoordinator | | epcampus | | epinstructor | +--------------------+ 19 rows in set (1 min 53.37 sec)
OK. Let's scope out the rates at which these user types have their edits reverted.
> select ug_group, COUNT(*) AS revisions, SUM(reverted = "True") AS reverted, SUM(reverted = "True")/COUNT(*) AS prop from rev_reverted_20k_sample INNER JOIN enwiki.revision USING(rev_id) INNER JOIN enwiki.user_groups ON rev_user = ug_user WHERE wiki = 'enwiki' GROUP BY ug_group ORDER BY prop DESC; +--------------------+-----------+----------+--------+ | ug_group | revisions | reverted | prop | +--------------------+-----------+----------+--------+ | massmessage-sender | 18 | 1 | 0.0556 | | eponline | 75 | 3 | 0.0400 | | oversight | 28 | 1 | 0.0357 | | bot | 787 | 25 | 0.0318 | | rollbacker | 949 | 27 | 0.0285 | | checkuser | 37 | 1 | 0.0270 | | abusefilter | 85 | 2 | 0.0235 | | reviewer | 1442 | 29 | 0.0201 | | epcampus | 51 | 1 | 0.0196 | | sysop | 526 | 10 | 0.0190 | | templateeditor | 131 | 2 | 0.0153 | | autoreviewer | 1122 | 17 | 0.0152 | | ipblock-exempt | 153 | 2 | 0.0131 | | epcoordinator | 10 | 0 | 0.0000 | | filemover | 182 | 0 | 0.0000 | | import | 2 | 0 | 0.0000 | | epinstructor | 17 | 0 | 0.0000 | | accountcreator | 64 | 0 | 0.0000 | | bureaucrat | 7 | 0 | 0.0000 | +--------------------+-----------+----------+--------+ 19 rows in set (12.27 sec)
Looks like bots get reverted *a lot* more often than I thought. Let's review some of those.
> select CONCAT("https://en.wikipedia.org/wiki/?diff=", rev_id) from rev_reverted_20k_sample INNER JOIN enwiki.revision USING(rev_id) INNER JOIN enwiki.user_groups ON rev_user = ug_user WHERE wiki = 'enwiki' AND ug_group = "bot" AND reverted = "True"; +--------------------------------------------------------+ | CONCAT("https://en.wikipedia.org/wiki/?diff=", rev_id) | +--------------------------------------------------------+
- https://en.wikipedia.org/wiki/?diff=610159528
- https://en.wikipedia.org/wiki/?diff=631287010
- https://en.wikipedia.org/wiki/?diff=616939273
- https://en.wikipedia.org/wiki/?diff=607764034
- https://en.wikipedia.org/wiki/?diff=645625795
- https://en.wikipedia.org/wiki/?diff=650582641
- https://en.wikipedia.org/wiki/?diff=644879750
- https://en.wikipedia.org/wiki/?diff=610470434
- https://en.wikipedia.org/wiki/?diff=634433430
- https://en.wikipedia.org/wiki/?diff=641163961
- https://en.wikipedia.org/wiki/?diff=653989798
- https://en.wikipedia.org/wiki/?diff=637104279
- https://en.wikipedia.org/wiki/?diff=624527434
- https://en.wikipedia.org/wiki/?diff=642540479
- https://en.wikipedia.org/wiki/?diff=645491547
- https://en.wikipedia.org/wiki/?diff=616937755
- https://en.wikipedia.org/wiki/?diff=614474379
- https://en.wikipedia.org/wiki/?diff=619444774
- https://en.wikipedia.org/wiki/?diff=633824749
- https://en.wikipedia.org/wiki/?diff=615208956
- https://en.wikipedia.org/wiki/?diff=655466589
- https://en.wikipedia.org/wiki/?diff=627909158
- https://en.wikipedia.org/wiki/?diff=630177778
- https://en.wikipedia.org/wiki/?diff=605125433
- https://en.wikipedia.org/wiki/?diff=640027043
25 rows in set (0.02 sec)
Yeah. None of those were damaging. They look like regular bot activities for the most part. They sometimes get caught up in reverts of other vandalism. I bet that this is also how rollbackers get reverted
> select CONCAT("https://en.wikipedia.org/wiki/?diff=", rev_id) from rev_reverted_20k_sample INNER JOIN enwiki.revision USING(rev_id) INNER JOIN enwiki.user_groups ON rev_user = ug_user WHERE wiki = 'enwiki' AND ug_group = "rollbacker" AND reverted = "True"; +--------------------------------------------------------+ | CONCAT("https://en.wikipedia.org/wiki/?diff=", rev_id) | +--------------------------------------------------------+
- https://en.wikipedia.org/wiki/?diff=607764034
- https://en.wikipedia.org/wiki/?diff=630029440
- https://en.wikipedia.org/wiki/?diff=607495217
- https://en.wikipedia.org/wiki/?diff=618993453
- https://en.wikipedia.org/wiki/?diff=610470434
- https://en.wikipedia.org/wiki/?diff=604468936
- https://en.wikipedia.org/wiki/?diff=633622555
- https://en.wikipedia.org/wiki/?diff=632698565
- https://en.wikipedia.org/wiki/?diff=653411592
- https://en.wikipedia.org/wiki/?diff=633562045
- https://en.wikipedia.org/wiki/?diff=640998141
- https://en.wikipedia.org/wiki/?diff=615592915
- https://en.wikipedia.org/wiki/?diff=624527434
- https://en.wikipedia.org/wiki/?diff=645491547
- https://en.wikipedia.org/wiki/?diff=655033146
- https://en.wikipedia.org/wiki/?diff=615208956
- https://en.wikipedia.org/wiki/?diff=647808597
- https://en.wikipedia.org/wiki/?diff=650442033
- https://en.wikipedia.org/wiki/?diff=639009718
- https://en.wikipedia.org/wiki/?diff=623387475
- https://en.wikipedia.org/wiki/?diff=618309794
- https://en.wikipedia.org/wiki/?diff=645127720
- https://en.wikipedia.org/wiki/?diff=637571432
- https://en.wikipedia.org/wiki/?diff=653758345
- https://en.wikipedia.org/wiki/?diff=635947023
- https://en.wikipedia.org/wiki/?diff=625493402
- https://en.wikipedia.org/wiki/?diff=629187946
+--------------------------------------------------------+ 27 rows in set (1.07 sec)
Yup. Same story here. It looks like we can at least remove these guys. OK. I think that this makes sense for enwiki. I'd like to filter:
- oversight
- bot
- rollbacker
- checkuser
- abusefilter
- reviewer
- sysop
- templateeditor
- autoreviewer
- ipblock-exempt
- filemover
- import
- accountcreator
- bureaucrat
I wonder what proportion that will get.
> SELECT COUNT(*) FROM rev_reverted_20k_sample WHERE wiki = "enwiki"; +----------+ | COUNT(*) | +----------+ | 6179 | +----------+ 1 row in set (0.01 sec) > SELECT COUNT(DISTINCT rev_id) FROM rev_reverted_20k_sample INNER JOIN enwiki.revision USING(rev_id) INNER JOIN enwiki.user_groups ON rev_user = ug_user WHERE wiki = 'enwiki' AND ug_group IN ("oversight", "bot", "rollbacker", "checkuser", "abusefilter", "reviewer", "sysop", "templateeditor", "autoreviewer", "ipblock-exempt", "filemover", "import", "accountcreator", "bureaucrat"); +------------------------+ | COUNT(DISTINCT rev_id) | +------------------------+ | 2976 | +------------------------+ 1 row in set (1 min 20.34 sec) > select 2976/6179; +-----------+ | 2976/6179 | +-----------+ | 0.4816 | +-----------+ 1 row in set (0.01 sec)
Well, that looks like it's going to get rid of about 50% of the work. That's pretty substantial.
It'll be harder for me to check the other wikis since I'm not a native speaker, but if we see similar proportions of reverted edits there, then I think we should be good to go-ahead.
Portuguese
edit> select ug_group, COUNT(*) AS revisions, SUM(reverted = "True") AS reverted, SUM(reverted = "True")/COUNT(*) AS prop from rev_reverted_20k_sample INNER JOIN ptwiki.revision USING(rev_id) INNER JOIN ptwiki.user_groups ON rev_user = ug_user WHERE wiki = 'ptwiki' GROUP BY ug_group ORDER BY prop DESC; +------------------+-----------+----------+--------+ | ug_group | revisions | reverted | prop | +------------------+-----------+----------+--------+ | interface-editor | 51 | 3 | 0.0588 | | ipblock-exempt | 34 | 2 | 0.0588 | | epinstructor | 97 | 5 | 0.0515 | | bureaucrat | 466 | 17 | 0.0365 | | checkuser | 336 | 12 | 0.0357 | | eliminator | 412 | 13 | 0.0316 | | rollbacker | 2662 | 80 | 0.0301 | | sysop | 2483 | 59 | 0.0238 | | epcoordinator | 273 | 6 | 0.0220 | | autoreviewer | 3704 | 81 | 0.0219 | | oversight | 96 | 2 | 0.0208 | | bot | 3954 | 13 | 0.0033 | | eponline | 5 | 0 | 0.0000 | | epcampus | 6 | 0 | 0.0000 | +------------------+-----------+----------+--------+ 14 rows in set (1 min 57.71 sec)
Turkish
edit> select ug_group, COUNT(*) AS revisions, SUM(reverted = "True") AS reverted, SUM(reverted = "True")/COUNT(*) AS prop from rev_reverted_20k_sample INNER JOIN trwiki.revision USING(rev_id) INNER JOIN trwiki.user_groups ON rev_user = ug_user WHERE wiki = 'trwiki' GROUP BY ug_group ORDER BY prop DESC; +------------+-----------+----------+--------+ | ug_group | revisions | reverted | prop | +------------+-----------+----------+--------+ | sysop | 720 | 22 | 0.0306 | | patroller | 3052 | 59 | 0.0193 | | technician | 639 | 10 | 0.0156 | | autoreview | 937 | 12 | 0.0128 | | bureaucrat | 181 | 2 | 0.0110 | | bot | 1649 | 1 | 0.0006 | | checkuser | 29 | 0 | 0.0000 | | oversight | 21 | 0 | 0.0000 | +------------+-----------+----------+--------+ 8 rows in set (48.72 sec)
Persian
edit> select ug_group, COUNT(*) AS revisions, SUM(reverted = "True") AS reverted, SUM(reverted = "True")/COUNT(*) AS prop from rev_reverted_20k_sample INNER JOIN fawiki.revision USING(rev_id) INNER JOIN fawiki.user_groups ON rev_user = ug_user WHERE wiki = 'fawiki' GROUP BY ug_group ORDER BY prop DESC; +----------------+-----------+----------+--------+ | ug_group | revisions | reverted | prop | +----------------+-----------+----------+--------+ | templateeditor | 53 | 5 | 0.0943 | | epcampus | 61 | 5 | 0.0820 | | epcoordinator | 61 | 5 | 0.0820 | | eponline | 74 | 5 | 0.0676 | | ipblock-exempt | 96 | 5 | 0.0521 | | epinstructor | 165 | 6 | 0.0364 | | OTRS-member | 153 | 5 | 0.0327 | | sysop | 479 | 15 | 0.0313 | | rollbacker | 559 | 13 | 0.0233 | | bureaucrat | 102 | 2 | 0.0196 | | Image-reviewer | 115 | 2 | 0.0174 | | autopatrol | 1480 | 23 | 0.0155 | | uploader | 7309 | 81 | 0.0111 | | patroller | 4092 | 35 | 0.0086 | | botadmin | 3278 | 23 | 0.0070 | | abusefilter | 2643 | 16 | 0.0061 | | bot | 5301 | 27 | 0.0051 | | eliminator | 127 | 0 | 0.0000 | +----------------+-----------+----------+--------+ 18 rows in set (44.21 sec)
End!
editOK. That's all for now. Ping User:He7d3r, User:とある白い猫, User:Ladsgroup. Please review and confirm whether you think it makes sense to filter edits from users by user_group and if so, give me a list of user_groups that you think make sense. --EpochFail (talk) 04:28, 14 May 2015 (UTC)
- I filtered out bots and I think it would be pretty helpful for us but about other groups I doubt that. Lots of old users and sysops made vandalism when they were newbies. Considering those vandalism as ok edits because the user made it is sysop now, damages our work. Check out proportion of reverted edits in groups like 'crats in pt.wp. Amir (talk) 12:59, 14 May 2015 (UTC)
- @Amir: The different proportion might be just because we only got 7 edits by bureaucrats on enwiki, versus 466 on ptwiki.
- I wonder if we get only 81/3704 reverts in the autoreviewer group because they make less questionable edits, or if patrollers tend to ignore autoreviewers more then they should...
- Also, what if sysop edits get more reverts just because they are the only users dealing with vandalism from people who like to revert the reverts?
- For now, I would remove only bots from ptwiki. Helder 13:57, 14 May 2015 (UTC)
Follow-up
editI want to quickly check how many edits we can pre-label if we only look at bots. Generally, I think that is too conservative, but it would be good to know if it is still going to gather the majority.
enwiki: 1301/10297 = 12.6%
|
---|
> select count(*) from rev_reverted_20k_sample where wiki = "enwiki"; +----------+ | count(*) | +----------+ | 10297 | +----------+ 1 row in set (0.01 sec) > SELECT COUNT(DISTINCT rev_id) FROM rev_reverted_20k_sample INNER JOIN enwiki.revision USING(rev_id) INNER JOIN enwiki.user_groups ON rev_user = ug_user WHERE wiki = 'enwiki' AND ug_group = "bot"; +------------------------+ | COUNT(DISTINCT rev_id) | +------------------------+ | 1301 | +------------------------+ 1 row in set (3 min 8.08 sec) |
ptwiki: 3954/19989 = 19.8%
|
---|
> SELECT COUNT(DISTINCT rev_id) FROM rev_reverted_20k_sample INNER JOIN ptwiki.revision USING(rev_id) INNER JOIN ptwiki.user_groups ON rev_user = ug_user WHERE wiki = 'ptwiki' AND ug_group = "bot"; +------------------------+ | COUNT(DISTINCT rev_id) | +------------------------+ | 3954 | +------------------------+ 1 row in set (2 min 18.47 sec) mysql:research@analytics-store.eqiad.wmnet [staging]> select 3954/19989; +------------+ | 3954/19989 | +------------+ | 0.1978 | +------------+ 1 row in set (0.00 sec) |
trwiki: 1649/8792 = 18.8%
|
---|
> select count(*) from rev_reverted_20k_sample where wiki = "trwiki"; +----------+ | count(*) | +----------+ | 8792 | +----------+ 1 row in set (0.00 sec) > SELECT COUNT(DISTINCT rev_id) FROM rev_reverted_20k_sample INNER JOIN trwiki.revision USING(rev_id) INNER JOIN trwiki.user_groups ON rev_user = ug_user WHERE wiki = 'trwiki' AND ug_group = "bot"; +------------------------+ | COUNT(DISTINCT rev_id) | +------------------------+ | 1649 | +------------------------+ 1 row in set (55.75 sec) |
fawiki: 5301/8211 = 64.6%
|
---|
> select count(*) from rev_reverted_20k_sample where wiki = "fawiki";+----------+ | count(*) | +----------+ | 8211 | +----------+ 1 row in set (0.00 sec) > SELECT COUNT(DISTINCT rev_id) FROM rev_reverted_20k_sample INNER JOIN fawiki.revision USING(rev_id) INNER JOIN fawiki.user_groups ON rev_user = ug_user WHERE wiki = 'fawiki' AND ug_group = "bot"; +------------------------+ | COUNT(DISTINCT rev_id) | +------------------------+ | 5301 | +------------------------+ 1 row in set (51.85 sec) |
Wow! There are *a lot* of bot edits in fawiki! --EpochFail (talk) 16:22, 14 May 2015 (UTC)
- Ping User:He7d3r, User:とある白い猫, User:Ladsgroup. What do you think? --EpochFail (talk) 16:24, 14 May 2015 (UTC)
- Persian Wikipedia relies heavily on bots to do almost anything. This wiki has four (or three) admin bots but it has only 30 human admins ;)Amir (talk) 17:06, 14 May 2015 (UTC)
- I would go with bots for now (even they can cause wrong labelling of revisions, due to bot errors - e.g. Salebot sometimes reverts some good-faith edits - no idea how often that happens). On the other hand, if the idea is that the pre-labelling would just make these edits to not appear in the queue of tasks until the other (more important) revisions are labelled by humans, then I see no problem in extending the list to bot+autoreviewer+sysop+bureaucrat+rollbacker+checkuser+eliminator+oversight for example. Helder 18:29, 14 May 2015 (UTC)
- I noticed the sudden jump from ~500 labels to ~11000 on ptwiki, so I assume this is now done? Helder 16:20, 16 May 2015 (UTC)
For Persian I suggest: templateeditor+OTRS-member+sysop+rollbacker+bureaucrat+Image-reviewer+autopatrol+patroller+botadmin+abusefilter+bot+eliminator Amir (talk) 11:56, 16 May 2015 (UTC)