Community Wishlist/Wishes/Bots detecting copyvios on Commons (image reverse search etc)/en

Bots detecting copyvios on Commons (image reverse search etc) Open

Edit wish Discuss this wish

Description

There are still countless files that are copyright violations on Wikimedia Commons. Most of these aren't used anywhere, hard to find, and every day many new copyvios are uploaded as needles among the very large haystack of files uploaded.

Detecting copyright violations there I think usually works by contributors coming across a media file by accident where they have a suspicion that it isn't licensed as free media due to which they run an image reverse scan, or check the link or text provided in the source field, or do a Web search for the file's title/contents to identify the source and verify the license.

The problems with that are that it's:

  • quite time-intensive (time that could be used for more impactful things)
  • many copyvio files are not detected (especially uncategorized files)

I think files should be scanned with a bot (AI tools) that does a TinEye and/or Google image reverse search and/or another Reverse image search engine to check if it's likely a copyvio, especially for new uploaders. The bots/script could populate some categories which people then via some tool quickly review (as copyvio or normal DR or false positive).


This would work similar to how in this new study a tool creates suggestions for improving citations using the AI-based system SIDE. In this case, the tool should scan the source link if it a) is still online (if not add an archived version link) and b) whether it supports the file claim of license & source (if not flag as needing semi-manual review) and c) do a reverse search if no credible source has been provided.

The image reverse search would be the likely first and main component. Additional bots that scan videos such as reverse searching the included music similar to Shazam (music app) could detect videos with copyvio music and maybe even automatically mute these parts, starting with the files in c:Category:Videos featuring unidentified music.

Instead of scanning every single file, it could be made to scan only one or a few files of the same series (e.g. photos with the same/close dates located in the same category) or at first do this only for uploads by users who are new or have uploaded only few files or have uploaded several copyvios in the past. If this would strain the servers too much, the proposal about physical backups may be useful as one could analyze that dump rather than retrieve the media from WMC during scans.

See also c:Commons:Abuse filter/Automated copyvio detection

Assigned focus area

Unassigned.

Type of wish

Feature request

Wikimedia Commons

Affected users

Wikimedia Commons contributors/mods/admins

Other details

  • Created: 16:31, 17 August 2024 (UTC)
  • Last updated: 07:49, 21 August 2024 (UTC)
  • Author: Prototyperspective (talk)