Research:Baseline Metrics for Structured Data on Wikimedia Commons

This page documents a completed research project.
Tracked in Phabricator:
Task T174519

Wikimedia Commons, a sister project of Wikipedia, is a collection of more than 50 million free media files as of February 2019[1]. The project Structured Data on Wikimedia Commons (SDoC) converts information about these files to a structured and machine-readable format – making them easier to view, search, edit, organize and re-use – in many languages. This is implemented with Wikibase, the same technology as used for Wikidata. Wikimedia community members and staff from the Wikimedia Foundation (WMF) and Wikimedia Deutschland (WMDE) (the Wikidata team) will work on this project from 2017 till the end of 2019.[2]

In order to measure the effectiveness of new functionalities on Wikimedia Commons, we need to establish relevant criteria that can be measured and a (2017) baseline against which we can compare in the future. After discussing with the SDoC team, we decided to compute metrics in the following aspects:

  • File contribution: How many files are uploaded by bots vs users?
  • File type: Number of files by type and their trend over time
  • File description: Information box usage, categorization, and languages
  • File deletion: Time to deletion and reason of deletion
  • File search: How many search queries happen in what languages? How satisfied are users with their searches on Commons?

File contribution

edit

We are interested in comparing the number of files uploaded by bots and by users.

Methods

edit

On October 12, 2017, we queried the image table of Wikimedia Commons database and counted the number of files uploaded by bots vs others. Then we broke down the counts by media type (defined by img_media_type field in image table) and by month. We identified user accounts that belong to bot group in user_groups table or user_former_groups table, and user accounts that belong to categories whose name match _(bot_flag|bots)(_|$), as bot accounts. Some bots are operated by institutions, while some are automatic tools like Flickr upload bot.

Results

edit

As of October 12, 2017, the number of files uploaded by bots was 9,390,721 (22.03%), and the number of files uploaded by users was 33,241,541 (77.97%). The following table breaks down the counts by media type:

Media Type User Group Number of Files Proportion
bitmap user 31355343 73.55%
bitmap bot 8843447 20.74%
drawing user 905964 2.13%
drawing bot 270516 0.63%
audio user 698566 1.64%
audio bot 95646 0.22%
video user 71738 0.17%
video bot 36329 0.09%
multimedia user 4 0%
office user 209926 0.49%
office bot 144783 0.34%

The following two graphs break down the counts by month:

File type

edit
 
Treemap of files uploaded. Note that due to an overwhelming volume of JPG/JPEG files uploaded (37M), that extension has been excluded from this visualization.

We are interested in the distribution of file types and extensions uploaded by users over time, including how many files of each type are uploaded on a monthly basis.

Methods

edit

On October 12, 2017, we queried the image table of Wikimedia Commons database and counted files uploaded by extension per day.

Results

edit

As of October 12, 2017, the majority of files uploaded to Wikimedia Commons were images (see the treemap on the side) with nearly 37M JPG/JPEG, 2.3M PNG, and 1.2M SVG files total. Those file formats are also the most voluminous uploads on a monthly basis (with daily upload counts available in the repository):

The table below shows the full breakdown by media type and extension as of October 12th, 2017:

media extension uploads
audio ogg 773305
audio oga 6180
audio flac 6140
audio mid 4993
audio wav 3512
audio opus 410
docs pdf 354765
docs djvu 60524
image jpg/jpeg 36918799
image png 2268026
image svg 1176530
image tif/tiff 807921
image gif 153959
image xcf 1008
image webp 95
video ogv 66610
video webm 41161

File description

edit

Typically, every media file on Commons is accompanied by plain-text descriptions (wikitext, templates) and categories. The findability of files is largely determined by this information. Hence, we need to figure out the information box usage, categorization, and language usage of files on Commons.

Methods

edit

On December 12, 2017, we queried the image table of the Wikimedia Commons database and counted the number of files by the number of categories they belong to. We excluded hidden categories by querying the page_props table, and excluded "needing_category" categories by querying the categorylinks table, from the counts.

In order to figure out information box and language usage of files on Commons, we parsed the wikitext of all files in the Commons xml data dumps of November 20, 2017 with the mwparserfromhell library to extract their infobox templates and language templates (e.g. {{en}}, {{LangSwitch}}). It is worth noting that the way we extracted infobox templates is different from the CommonsMetadata extension. We parsed the wikitext instead of the HTML, and included not only standard infobox templates but also templates that have a description field.

For files without a language template, we used the langdetect package to detect their languages.

Results

edit

There are 1,629,592 (3.73%) files on Commons that don't belong to any category. The following graph shows the number of files by the number of categories:

The following table breaks down the counts further by media type:

Media Type Categories Number of Files Proportion
AUDIO 0 category 2007 0.25%
AUDIO 1 category 404697 50.54%
AUDIO 2 categories 346496 43.27%
AUDIO 3 categories 15327 1.91%
AUDIO 4 categories 6667 0.83%
AUDIO 5+ categories 25552 3.19%
BITMAP 0 category 1599973 3.89%
BITMAP 1 category 21292109 51.73%
BITMAP 2 categories 9944133 24.16%
BITMAP 3 categories 4379117 10.64%
BITMAP 4 categories 1886515 4.58%
BITMAP 5+ categories 2057464 5%
DRAWING 0 category 11228 0.94%
DRAWING 1 category 485009 40.5%
DRAWING 2 categories 358924 29.97%
DRAWING 3 categories 149269 12.46%
DRAWING 4 categories 118525 9.9%
DRAWING 5+ categories 74735 6.24%
MULTIMEDIA 1 category 2 50%
MULTIMEDIA 2 categories 1 25%
MULTIMEDIA 3 categories 1 25%
OFFICE 0 category 3869 1.06%
OFFICE 1 category 285574 78.54%
OFFICE 2 categories 38990 10.72%
OFFICE 3 categories 24767 6.81%
OFFICE 4 categories 5918 1.63%
OFFICE 5+ categories 4475 1.23%
VIDEO 0 category 12515 11.31%
VIDEO 1 category 25489 23.04%
VIDEO 2 categories 19412 17.55%
VIDEO 3 categories 13702 12.39%
VIDEO 4 categories 10028 9.06%
VIDEO 5+ categories 29483 26.65%

Out of the total 43,268,565 files, 41,796,560 (96.6%) files have an infobox, 41,309,028 (95.47%) files have some contents in their description fields (description, title, depicted people, depicted place, etc.) as of November 20, 2017.

There are 14,848,551 (34.32%) files that don't have any language templates, and 23,780,247 (54.96%) files use only 1 language template. 40.1% of all files have English templates, 9.38% of files use German, and 6.2% of files have descriptions in languages which are not in the top 20.

For files without language templates, we detected 1 language for 7,577,789 (17.51% of all 43,268,565 files) files. There are 556,684 files (1.29%) in which no language was detected. English was detected in 30.25% of all 43,268,565 files.

There are a lot of variations on the infobox and language templates, which makes counting them challenging, and thus our results are imperfect. For example, some files' descriptions are hidden in customized templates; some infobox templates use non-standard parameter names. See variations on the {{Information}} template for more examples.

File deletion

edit

Methods

edit

On October 12, 2017, we queried the filearchive table of Wikimedia Commons database and computed/extracted the following:

  • Number of deleters (users who have deleted at least one file) over time
  • How many files each user has deleted
  • Time to deletion, broken up by file type and reason for deletion (copyright violation vs other)
  • Reasons for deletion

Results

edit

Time to deletion

edit

Reasons for deletion

edit

This was the most difficult to get because of the variability in users' comments ("copyright" vs "copyvio"). The full logic is available for reference via the SQL query in the repository. The numbers below are an approximation, as there were bound to be reasons that were not caught by the logic. We recommend the creation of an interface that would have the most common deletion reasons so that when the user selects it, a standardized message is automatically inserted, which would make tracking these numbers much easier going forward.

edit

We wanted to know which languages users search Wikimedia Commons in, and how satisfied they are.

Methods

edit

We used Google's Compact Language Detector (CLDv3) and MediaWiki's TextCat (an updated version trained on Wikimedia queries and articles) to detect languages in queries of December 2017 on Commons. It should be noted that these numbers are to be taken with a HUGE pile of salt. Language detection on short text (such as search queries) is extremely difficult to get correct because of how few characters the algorithms have to make a prediction from. (Unlike, say, detecting language of an essay or a book.) This is especially problematic when the search query was a proper noun. For example, only CLDv3 returned a language prediction for "Wii u" (a Nintendo game console) but that prediction was "German", while the other methods such as TextCat and CLDv2 did not return anything. Or you have cases like "solar" being detected as Azerbaijani while in the same session none of the algorithms detected any language for "solar system".

We computed several desktop search metrics with event logging data in November 2017, and compare them with English Wikipedia. Specifically, we computed:

  • Zero results rate: Proportion of searches that did not yield any results. The lower, the better.
  • Click-through rate: Proportion of searches with at least one click on the search results. The higher, the better.
  • Proportion of searches with clicks to see other pages of the search results. The lower, the better.

Results

edit

The following table shows the approximate proportion of search queries on Commons for each language:

language approx. %
English 38.4243%
German 5.0856%
French 3.9716%
Latin 3.3742%
Italian 3.3097%
Spanish; Castilian 2.6316%
Portuguese 2.0665%
Norwegian 1.7598%
Dutch; Flemish 1.7113%
Danish 1.6629%
Afrikaans 1.5983%
Chinese 1.3400%
Catalan; Valencian 1.3077%
Japanese 1.2916%
Polish 1.1786%
Luxembourgish; Letzeburgesch 1.0494%
Serbian 1.0494%
Welsh 1.0171%
Galician 0.8557%
Estonian 0.8395%
Hindi 0.8395%
Finnish 0.8234%
Russian 0.7588%
Czech 0.7427%
Indonesian 0.7427%
Swedish 0.7265%
Western Frisian 0.7265%
Bulgarian 0.6942%
Hungarian 0.6619%
Ukrainian 0.6619%
Malagasy 0.6458%
Javanese 0.6296%
Greek, Modern (1453-) 0.5974%
Haitian; Haitian Creole 0.5974%
Romanian; Moldavian; Moldovan 0.5328%
Malay 0.5166%
Basque 0.5005%
Bosnian 0.5005%
Slovenian 0.5005%
Corsican 0.4682%
Esperanto 0.4682%
Sundanese 0.4521%
Turkish 0.4359%
Hausa 0.4198%
Lithuanian 0.3713%
Shona 0.3713%
Arabic 0.3552%
Igbo 0.3552%
Gaelic; Scottish Gaelic 0.3390%
Irish 0.3390%
Latvian 0.3390%
Azerbaijani 0.3229%
Belarusian 0.3229%
Chichewa; Chewa; Nyanja 0.3229%
Kirghiz; Kyrgyz 0.3229%
Korean 0.3229%
Somali 0.3229%
Maltese 0.3067%
Croatian 0.2583%
Sotho, Southern 0.2422%
Icelandic 0.2260%
Persian 0.2260%
Uzbek 0.2260%
Swahili 0.1937%
Xhosa 0.1937%
Samoan 0.1776%
Zulu 0.1614%
Kurdish 0.1453%
Yoruba 0.1453%
Kazakh 0.1292%
Macedonian 0.1130%
Mongolian 0.1130%
Slovak 0.1130%
Maori 0.0969%
Thai 0.0969%
Vietnamese 0.0969%
Albanian 0.0807%
Breton 0.0807%
Georgian 0.0807%
Kinyarwanda 0.0807%
Sindhi 0.0807%
Tagalog 0.0484%
Tajik 0.0484%
Bengali 0.0323%
Marathi 0.0323%
Pushto; Pashto 0.0323%
Armenian 0.0161%
Burmese 0.0161%
Central Khmer 0.0161%
Tamil 0.0161%
Urdu 0.0161%
Yiddish 0.0161%

7.51% of full-text searches on desktop did not yield any results on Commons, which is slightly lower than English Wikipedia.

The full-text search results click-through rate is 10.42% on Commons, which is much lower than that on English Wikipedia. For autocomplete search, Commons' clickthrough rate is about 19%.

Users are much more likely to click to see other pages of search results on Commons (10.85%) than users on English Wikipedia (0.31%). Together with clickthrough rate, we believe that users are less likely to find what they want through search on Commons.

References

edit
  1. "Statistics - Wikimedia Commons". commons.wikimedia.org. Retrieved 2019-02-26. 
  2. "Commons:Structured data/About". Retrieved 14 December 2017.