Research talk:Revision scoring as a service/Work log/2016-01-31
Sunday, January 31, 2016
editGot some of the edits labeled for vandalism (488536, so about 50%) and got impatient, so I had an early look. I randomized the order of the edits being processed, so this should be roughly representative.
> select anon_user, trusted_user, trusted_edits, client_edit, merge_edit, COUNT(*) AS edits, SUM(reverted) AS reverted, SUM(reverted)/COUNT(*) AS prop FROM wikidata_nonbot_reverted_sample GROUP BY 1,2,3,4,5 ORDER BY edits; +-----------+--------------+---------------+-------------+------------+--------+----------+--------+ | anon_user | trusted_user | trusted_edits | client_edit | merge_edit | edits | reverted | prop | +-----------+--------------+---------------+-------------+------------+--------+----------+--------+ | 0 | 1 | 0 | 0 | 1 | 4 | 0 | 0.0000 | | 0 | 1 | 0 | 1 | 0 | 9 | 0 | 0.0000 | | 1 | 0 | 0 | 0 | 1 | 22 | 0 | 0.0000 | | 0 | 1 | 0 | 0 | 0 | 34 | 0 | 0.0000 | | 0 | 1 | 1 | 1 | 0 | 414 | 1 | 0.0024 | | 0 | 0 | 0 | 0 | 1 | 866 | 8 | 0.0092 | | 0 | 1 | 1 | 0 | 1 | 3355 | 10 | 0.0030 | | 0 | 0 | 1 | 0 | 1 | 3994 | 20 | 0.0050 | | 0 | 0 | 0 | 1 | 0 | 4012 | 64 | 0.0160 | | 0 | 0 | 1 | 1 | 0 | 5664 | 44 | 0.0078 | | 1 | 0 | 0 | 0 | 0 | 6914 | 499 | 0.0722 | | 0 | 0 | 0 | 0 | 0 | 15546 | 123 | 0.0079 | | 0 | 1 | 1 | 0 | 0 | 195032 | 222 | 0.0011 | | 0 | 0 | 1 | 0 | 0 | 252670 | 891 | 0.0035 | +-----------+--------------+---------------+-------------+------------+--------+----------+--------+ 14 rows in set (0.42 sec)
It looks like anonymous edits clearly have the highest revert probability. Second, it looks like client edits show up here, but I can't see how they are vandalism or would even need to be reviewed. We'll want to think about those, but I think we can just exclude them by default. If they are vandalism, they are vandalism on the originating wiki. Merges are also an interesting case here, but it's hard to see what's going on with all of these dimensions broken out, so let's look at them by themselves.
Merge edits
edit> select merge_edit, COUNT(*) AS edits, SUM(reverted) AS reverted, SUM(reverted)/COUNT(*) AS prop FROM wikidata_nonbot_reverted_sample WHERE NOT client_edit GROUP BY merge_edit; +------------+--------+----------+--------+ | merge_edit | edits | reverted | prop | +------------+--------+----------+--------+ | 0 | 470196 | 1735 | 0.0037 | | 1 | 8241 | 38 | 0.0046 | +------------+--------+----------+--------+ 2 rows in set (0.30 sec)
It looks like merge edits are "reverted" at about the same rate as regular edits. Luckily, only 38 of them were reverted in the entire set (!!!) so we can review those manually.
- wikidata:Special:Diff/199024218 -- Good faith mistake -- Should have been redirected instead of merged.
- wikidata:Special:Diff/195570481 -- Good edit -- arguably a good merge between potentially synonymous species names. No comment given with the revert as to why the merge should not happen.
- wikidata:Special:Diff/214979731 -- Good faith mistake -- Should have been redirected instead of merged.
- wikidata:Special:Diff/260089984 -- Good faith mistake -- Should have been redirected instead of merged.
- wikidata:Special:Diff/199731052 -- Good faith mistake -- Should have been redirected instead of merged.
- wikidata:Special:Diff/257429145 -- Good edit -- Looks like this was a merge in prep for a redirect
- wikidata:Special:Diff/200811759 -- Good edit -- Not sure what is wrong here.
- wikidata:Special:Diff/281483086 -- Good faith mistake -- Merging disambig.
- wikidata:Special:Diff/276945045 -- Good faith mistake -- Merging category page.
- wikidata:Special:Diff/253777080 -- Good faith mistake -- Merging category page.
- wikidata:Special:Diff/200811697 -- Good faith mistake -- Merging category page.
- wikidata:Special:Diff/207039725 -- Good faith mistake -- Merging spice into the plant from which the spice is extracted.
- wikidata:Special:Diff/258415268 -- Good edit -- Not actually reverted.
- wikidata:Special:Diff/206710538 -- Good faith mistake -- Merging two distinct paintings
- wikidata:Special:Diff/256849045 -- Good faith mistake -- Merging disambig.
- wikidata:Special:Diff/218615699 -- Good edit that could have been better -- Merged from when it should have been into
- wikidata:Special:Diff/253449661 -- Good faith mistake -- Merging "Villy" and "Willy".
- wikidata:Special:Diff/185523169 -- Good faith mistake -- Merging disambig.
- wikidata:Special:Diff/267098110 -- Good faith mistake -- Merging disambig.
- wikidata:Special:Diff/221768345 -- Good edit -- Merging two identical categories.
- wikidata:Special:Diff/212592822 -- Good edit -- Merging two synonyms.
- wikidata:Special:Diff/186720228 -- Good edit -- Content dispute.
- wikidata:Special:Diff/207621833 -- Good edit -- No explanation for revert.
- wikidata:Special:Diff/222826331 -- Good edit -- Merging synonyms.
- wikidata:Special:Diff/276780232 -- Good edit -- Merging same names.
- wikidata:Special:Diff/226786565 -- Good edit -- Merges misspelling into proper spelling.
- wikidata:Special:Diff/260047383 -- Good faith mistake -- Merging disambig.
- wikidata:Special:Diff/254584474 -- Good edit -- Merging synonyms.
- wikidata:Special:Diff/215383193 -- Good faith mistake -- Merging disambig.
- wikidata:Special:Diff/223066069 -- Good edit -- Merging synonyms.
- wikidata:Special:Diff/240210730 -- Good edit -- Merging person with dictionary item about them.
- wikidata:Special:Diff/201675373 -- Good edit -- Merging synonyms.
- wikidata:Special:Diff/260701745 -- Good faith mistake -- Same author and title, different paintings.
- wikidata:Special:Diff/216910166 -- Good faith mistake -- Same name, similar birthdays, different people.
- wikidata:Special:Diff/275261555 -- Good faith mistake -- Merging an item about two authors into each of their individual entries.
- wikidata:Special:Diff/209586001 -- Good edit -- Not sure what is going on here.
- wikidata:Special:Diff/245826673 -- Good faith mistake -- Process vs. outcome of the process.
- wikidata:Special:Diff/260728996 -- Good faith mistake -- Two different art exhibits with the same name.
OK! No intentional damage and a lot of the "good faith reverteds" were actually kind of debatable. I had to learn a bit about Wikidata merge policy in order to figure out what was right/wrong here. --EpochFail (talk) 22:04, 31 January 2016 (UTC)
Trusted users
editOK. Now to look into edits by "trusted" users. These are users who have saved a lot of edits or have attained a high-level user-right in Wikidata.
> select trusted_edits OR trusted_user AS trusted, COUNT(*) AS edits, SUM(reverted) AS reverted, SUM(reverted)/COUNT(*) AS prop FROM wikidata_nonbot_reverted_sample WHERE NOT client_edit AND NOT merge_edit GROUP BY trusted; +---------+--------+----------+--------+ | trusted | edits | reverted | prop | +---------+--------+----------+--------+ | 0 | 22460 | 622 | 0.0277 | | 1 | 447736 | 1113 | 0.0025 | +---------+--------+----------+--------+ 2 rows in set (0.33 sec)
So, they get reverted about an order of magnitude less often than non-trusted users. They also save the vast majority of the "human" edits in Wikidata. That's great -- if we can ignore their reverts as not really being for "damage". So I think it is time to do some spot checking. We should probably be aiming at ~100 edits to get a sense for how often (if at all) such reverted edits are actually damaging and/or vandalism. --EpochFail (talk) 22:55, 31 January 2016 (UTC)
I'm going to do this one in etherpad: https://etherpad.wikimedia.org/p/wikidata_reverted_trusted_user_edits --EpochFail (talk) 23:01, 31 January 2016 (UTC)
I have evaluated 70 of the hundred and run out of steam. Gonna call it a night and pick this back up tomorrow. --EpochFail (talk) 00:37, 1 February 2016 (UTC)
Here we go!
- wikidata:Special:Diff/263613867 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/222450531 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/263019641 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/189582382 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/263629329 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/262569901 -- Good edit -- Reverted by bot
- wikidata:Special:Diff/258171559 -- Good faith mistake -- Adds "is given name" to disambig page.
- wikidata:Special:Diff/263464571 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/276191409 -- Good faith mistake -- Adds IMDB identifier for the wrong person.
- wikidata:Special:Diff/286554028 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/263027418 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263500121 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263012887 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263629537 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/208221899 -- Good faith mistake -- Adds broken person identifier
- wikidata:Special:Diff/286558300 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/286956395 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/200994695 -- Talk page edit -- Archived
- wikidata:Special:Diff/189573738 -- Good edit that could have been better -- "Said to be the same as" in the wrong direction
- wikidata:Special:Diff/262556439 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/263014849 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263605638 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263015199 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/287048828 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/263505849 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/189570766 -- Good edit that could have been better -- "Said to be the same as" in the wrong direction
- wikidata:Special:Diff/263527401 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/262490186 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/218701566 -- Good edit -- Removed via client edit
- wikidata:Special:Diff/263593153 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/244575443 -- Good edit that could have been better -- "gaelik games" --> "gaelik football"
- wikidata:Special:Diff/263549948 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/269780452 -- Good edit that could have been better -- Adds "country" attribute to a person
- wikidata:Special:Diff/197224987 -- Good edit that could have been better -- Adds "country" attribute to a person
- wikidata:Special:Diff/222472446 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/263457744 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263612485 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263477270 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/196424111 -- Good faith mistake -- Adds given name to category page
- wikidata:Special:Diff/263497763 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263011234 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/286965456 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/263830059 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/219629167 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/263588797 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/270756006 -- Good edit that could have been better -- Adds "silent film" and "short film" when could have added "silent short film"
- wikidata:Special:Diff/263515124 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/189567105 -- Good edit that could have been better -- "Said to be the same as" in the wrong direction
- wikidata:Special:Diff/200920956 -- Good edit -- Removed via client edit
- wikidata:Special:Diff/263610625 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/287040424 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/197249868 -- Good edit that could have been better -- "instance of" neighborhood -- one of the artists major works was a neighborhood
- wikidata:Special:Diff/252821177 -- Good faith mistake -- Adds duplicate coordinate
- wikidata:Special:Diff/236331210 -- Good edit -- Reverted without comment
- wikidata:Special:Diff/274949949 -- Good edit that could have been better -- Changes an external link to an equally valid link
- wikidata:Special:Diff/263464978 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/247514844 -- Good edit -- Removed via client edit
- wikidata:Special:Diff/270762241 -- Good edit that could have been better -- Adds "silent film" and "short film" when could have added "silent short film"
- wikidata:Special:Diff/263024032 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263578197 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263836661 -- Good edit adds facebook link that breaks facebook
- wikidata:Special:Diff/277190192 -- Good faith mistake -- References the wrong belarusian football player
- wikidata:Special:Diff/263020589 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263025186 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263043580 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/278445018 -- Good faith mistake -- Adds occupation to a historic site
- wikidata:Special:Diff/221078312 -- Good edit -- Removed via client edit
- wikidata:Special:Diff/263009992 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263604485 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/285846127 -- Good edit that could have been better -- Adds "as: trainee" instead of "position held: trainee"
- wikidata:Special:Diff/218109136 -- Project namespace edit
- wikidata:Special:Diff/248536356 -- Good edit -- Order series rather than just a series
- wikidata:Special:Diff/263527911 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263522068 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/208895390 -- Good edit -- Adds a local identifier
- wikidata:Special:Diff/270455892 -- Good edit -- No comment revert
- wikidata:Special:Diff/263025853 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263528450 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263451520 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263592008 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/263559365 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/262571103 -- Good edit -- No comment revert
- wikidata:Special:Diff/286230816 -- Good faith mistake -- Adds wikimedia disambig to a non-disambig item
- wikidata:Special:Diff/263580337 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/254959435 -- Good edit -- Removed via client edit
- wikidata:Special:Diff/286958509 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/275985516 -- Good edit that could have been better -- Adds individual person identifier to author pair
- wikidata:Special:Diff/211465832 -- Good edit -- No comment revert
- wikidata:Special:Diff/263610910 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/283919185 -- Good faith mistake -- Sets a too specific property type
- wikidata:Special:Diff/263030891 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/248224052 -- Good edit -- Reverting author is maybe a vandal
- wikidata:Special:Diff/263455502 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/262494536 -- Good edit -- No comment revert
- wikidata:Special:Diff/263551948 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/284344890 -- Good edit that could have been better -- Extended description too much and adds punct
- wikidata:Special:Diff/263002149 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/286956645 -- Good edit that could have been better -- "Settlement" --> "Village"
- wikidata:Special:Diff/263512657 -- Good edit that could have been better -- Moved "instance of" to "heritage"
- wikidata:Special:Diff/201209968 -- Good edit that could have been better -- Adds "country" attribute to a person
--EpochFail (talk) 16:22, 1 February 2016 (UTC)
I'm just realizing that there are a few "good faith mistakes" that are really "good edits that could be better". E.g. adding a country to a person is good, but it would be better if it was added to their nationality. Gonna fix that now. --EpochFail (talk) 16:25, 1 February 2016 (UTC)
So, 7 out of 98 reverted main namespace edits were event "damaging" and they were all clearly good-faith mistakes. I think that we can declare this set as not worth review. --EpochFail (talk) 16:29, 1 February 2016 (UTC)
Summary
editOK. Time to summarize what I think we have learned. It looks like we can exclude Merges and Trusted user edits from review (if we are looking for intentional vandalism) but we should probably include them if we are looking for "good-faith mistakes" too. It also seems like there are a few common patterns that we can probably pick up on. I added some feature requests for them:
- https://github.com/wiki-ai/editquality/issues/10 -- Add "is_wikimedia_category" and "is_wikimedia_disambig" to feature set for wikidata
- https://github.com/wiki-ai/editquality/issues/11 -- Add "adds_given_name" and "adds_country" for Wikidata
- https://github.com/wiki-ai/editquality/issues/12 -- Add "creates_redirect" to Wikidata
- https://github.com/wiki-ai/editquality/issues/13 -- Add reverting comment filter to label_reverted
One thing that I also realized is that there are a lot of reverts that are really content moves. E.g if one adds "country", "USA" to a person, the edit that moves that value to "nationality" would appear as two edits: one that removes the whole claim and another that adds a new claim of "nationality" "USA". This looks like a revert, but it is really just an improvement to a good, but not perfect edit.
So, how many edits does this remove from our set?
> select NOT (trusted_edits OR trusted_user OR client_edit OR merge_edit) AS needs_review, COUNT(*) AS edits, SUM(reverted) AS reverted, SUM(reverted)/COUNT(*) AS prop FROM wikidata_nonbot_reverted_sample GROUP BY needs_review; +--------------+--------+----------+--------+ | needs_review | edits | reverted | prop | +--------------+--------+----------+--------+ | 0 | 466076 | 1260 | 0.0027 | | 1 | 22460 | 622 | 0.0277 | +--------------+--------+----------+--------+ 2 rows in set (0.25 sec)
It looks like 4.82% of edits actually might need review. That's great! --EpochFail (talk) 18:06, 1 February 2016 (UTC)