Community Wishlist/Wishes/Physical Wikimedia Commons media dumps (for backups, AI models, more metadata)/en

Physical Wikimedia Commons media dumps (for backups, AI models, more metadata) Open

Description

A similar thing has previously been proposed in Wikimedia Commons' recent survey about most urgent or useful tech issues & wishes Technical needs survey/Media dumps. To quote from it:

There are no Wikimedia Commons dumps that include any media. There's an open Phabricator ticket since 2021 (T298394), but no major advances have been seen.

As a solution to this I think it would be best if people could come to WMF with some large hard drives (or rather send them if not ordering them as a bundle) and get a specified subsection copied on them such as:

all files that are in use in any Wikimedia project or
all files except for >1 GB video files and a large .tiff files folder or
all files except for >1 GB files and files without category or classed as likely copyvios
all files in c:Category:People by activity and c:Category:Audio files of music.

Currently, backups of the large amount of free media in WMC is apparently only stored in the same data centers as the primary copies so having some more separate or offline backups would be good. A good way for that would be to better enable third parties to copy over things via a service for getting physical data dumps. All of WMC data is quite small and would fit on a few 30TB disks but is too much to download.

It could also be a way to generate a little bit of revenue, maybe as part of Wikimedia Enterprise but I think likely the number of people or organizations getting these would be low for this to yield much.

In addition, it could enable many applications. For example, the datasets could be better used for AI training which could make all the effort that went into organizing the media and the metadata (especially categories) much more useful. This aspect is probably underrated and could be very useful to train or improve AI models and some of these could possibly also be used to improve WMC such as by creating categorization suggestions via machine image classification.

It could also be useful for non-AI bots that analyze all or many files of WMC and add data to them such as for example reading text in images of all files via OCR and then adding this info to either WMC or a complementary site from where this info can be pulled when on WMC. Or simply what SchlurcherBot is doing.

It could also reduce server load as people don't need to scrape the site but can get the whole thing or a select part as a dump. Ideally there would also be a software that merges or upgrades a given dump by writing changes due to edits or newer files when having two dumps of different times or when a drive with a dump is attached to a source Wikimedia server, essentially decentrally stored incremental backups. This proposal is specifically about copying files rather than downloading them.

It could also allow things like easily including the dump in backups placed on the Moon like it was done recently with Wikipedia as part of the IM-1 Lunar Library. - More information and related code issues can be found in the new page Commons:Commons:Dumps and backups.

Assigned focus area

Unassigned.

Type of wish

Feature request

Related projects

Wikimedia Commons, Wikipedia

Affected users

Wikimedia Commons users

Phabricator tasks

T298394, T351677

Other details

Created: 16:27, 4 August 2024 (UTC)
Last updated: 16:50, 17 September 2024 (UTC)
Author: Prototyperspective (talk)