Research:Baseline Metrics for Structured Data on Wikimedia Commons

Contact

Chelsy Xie

Wikimedia Foundation

Mikhail Popov

Wikimedia Foundation

Duration: 2017-10 – 2017-12

Open access
via meta.wikimedia.org

Open source
via github.com

Research:Projects

This page documents a completed research project.

Tracked in Phabricator:
Task T174519

Wikimedia Commons, a sister project of Wikipedia, is a collection of more than 50 million free media files as of February 2019^[1]. The project Structured Data on Wikimedia Commons (SDoC) converts information about these files to a structured and machine-readable format – making them easier to view, search, edit, organize and re-use – in many languages. This is implemented with Wikibase, the same technology as used for Wikidata. Wikimedia community members and staff from the Wikimedia Foundation (WMF) and Wikimedia Deutschland (WMDE) (the Wikidata team) will work on this project from 2017 till the end of 2019.^[2]

In order to measure the effectiveness of new functionalities on Wikimedia Commons, we need to establish relevant criteria that can be measured and a (2017) baseline against which we can compare in the future. After discussing with the SDoC team, we decided to compute metrics in the following aspects:

File contribution: How many files are uploaded by bots vs users?
File type: Number of files by type and their trend over time
File description: Information box usage, categorization, and languages
File deletion: Time to deletion and reason of deletion
File search: How many search queries happen in what languages? How satisfied are users with their searches on Commons?

File contribution

Tracked in Phabricator:
Task T177354

We are interested in comparing the number of files uploaded by bots and by users.

Methods

On October 12, 2017, we queried the image table of Wikimedia Commons database and counted the number of files uploaded by bots vs others. Then we broke down the counts by media type (defined by img_media_type field in image table) and by month. We identified user accounts that belong to bot group in user_groups table or user_former_groups table, and user accounts that belong to categories whose name match _(bot_flag|bots)(_|$), as bot accounts. Some bots are operated by institutions, while some are automatic tools like Flickr upload bot.

Results

As of October 12, 2017, the number of files uploaded by bots was 9,390,721 (22.03%), and the number of files uploaded by users was 33,241,541 (77.97%). The following table breaks down the counts by media type:

Media Type	User Group	Number of Files	Proportion
bitmap	user	31355343	73.55%
bitmap	bot	8843447	20.74%
drawing	user	905964	2.13%
drawing	bot	270516	0.63%
audio	user	698566	1.64%
audio	bot	95646	0.22%
video	user	71738	0.17%
video	bot	36329	0.09%
multimedia	user	4	0%
office	user	209926	0.49%
office	bot	144783	0.34%

The following two graphs break down the counts by month:

Number of files uploaded by bots vs by users
Proportion of files uploaded by bots vs by users

File type

Tracked in Phabricator:
Task T177356

Treemap of files uploaded. Note that due to an overwhelming volume of JPG/JPEG files uploaded (37M), that extension has been excluded from this visualization.

We are interested in the distribution of file types and extensions uploaded by users over time, including how many files of each type are uploaded on a monthly basis.

Methods

On October 12, 2017, we queried the image table of Wikimedia Commons database and counted files uploaded by extension per day.

Results

As of October 12, 2017, the majority of files uploaded to Wikimedia Commons were images (see the treemap on the side) with nearly 37M JPG/JPEG, 2.3M PNG, and 1.2M SVG files total. Those file formats are also the most voluminous uploads on a monthly basis (with daily upload counts available in the repository):

Cumulative upload counts by file extension.
Uploads per month by file extension.

The table below shows the full breakdown by media type and extension as of October 12th, 2017:

media	extension	uploads
audio	ogg	773305
audio	oga	6180
audio	flac	6140
audio	mid	4993
audio	wav	3512
audio	opus	410
docs	pdf	354765
docs	djvu	60524
image	jpg/jpeg	36918799
image	png	2268026
image	svg	1176530
image	tif/tiff	807921
image	gif	153959
image	xcf	1008
image	webp	95
video	ogv	66610
video	webm	41161

File description

Tracked in Phabricator:
Task T177353

Tracked in Phabricator:
Task T177358

Typically, every media file on Commons is accompanied by plain-text descriptions (wikitext, templates) and categories. The findability of files is largely determined by this information. Hence, we need to figure out the information box usage, categorization, and language usage of files on Commons.

Methods

On December 12, 2017, we queried the image table of the Wikimedia Commons database and counted the number of files by the number of categories they belong to. We excluded hidden categories by querying the page_props table, and excluded "needing_category" categories by querying the categorylinks table, from the counts.

In order to figure out information box and language usage of files on Commons, we parsed the wikitext of all files in the Commons xml data dumps of November 20, 2017 with the mwparserfromhell library to extract their infobox templates and language templates (e.g. {{en}}, {{LangSwitch}}). It is worth noting that the way we extracted infobox templates is different from the CommonsMetadata extension. We parsed the wikitext instead of the HTML, and included not only standard infobox templates but also templates that have a description field.

For files without a language template, we used the langdetect package to detect their languages.

Results

There are 1,629,592 (3.73%) files on Commons that don't belong to any category. The following graph shows the number of files by the number of categories:

Number of files by number of categories

The following table breaks down the counts further by media type:

Media Type	Categories	Number of Files	Proportion
AUDIO	0 category	2007	0.25%
AUDIO	1 category	404697	50.54%
AUDIO	2 categories	346496	43.27%
AUDIO	3 categories	15327	1.91%
AUDIO	4 categories	6667	0.83%
AUDIO	5+ categories	25552	3.19%
BITMAP	0 category	1599973	3.89%
BITMAP	1 category	21292109	51.73%
BITMAP	2 categories	9944133	24.16%
BITMAP	3 categories	4379117	10.64%
BITMAP	4 categories	1886515	4.58%
BITMAP	5+ categories	2057464	5%
DRAWING	0 category	11228	0.94%
DRAWING	1 category	485009	40.5%
DRAWING	2 categories	358924	29.97%
DRAWING	3 categories	149269	12.46%
DRAWING	4 categories	118525	9.9%
DRAWING	5+ categories	74735	6.24%
MULTIMEDIA	1 category	2	50%
MULTIMEDIA	2 categories	1	25%
MULTIMEDIA	3 categories	1	25%
OFFICE	0 category	3869	1.06%
OFFICE	1 category	285574	78.54%
OFFICE	2 categories	38990	10.72%
OFFICE	3 categories	24767	6.81%
OFFICE	4 categories	5918	1.63%
OFFICE	5+ categories	4475	1.23%
VIDEO	0 category	12515	11.31%
VIDEO	1 category	25489	23.04%
VIDEO	2 categories	19412	17.55%
VIDEO	3 categories	13702	12.39%
VIDEO	4 categories	10028	9.06%
VIDEO	5+ categories	29483	26.65%

Out of the total 43,268,565 files, 41,796,560 (96.6%) files have an infobox, 41,309,028 (95.47%) files have some contents in their description fields (description, title, depicted people, depicted place, etc.) as of November 20, 2017.

There are 14,848,551 (34.32%) files that don't have any language templates, and 23,780,247 (54.96%) files use only 1 language template. 40.1% of all files have English templates, 9.38% of files use German, and 6.2% of files have descriptions in languages which are not in the top 20.

The number of files by the number of language templates on Commons
The number of files by the top 20 language templates on Commons

For files without language templates, we detected 1 language for 7,577,789 (17.51% of all 43,268,565 files) files. There are 556,684 files (1.29%) in which no language was detected. English was detected in 30.25% of all 43,268,565 files.

For commons files without language templates, the number of files by the number of detected languages
For commons files without language templates, the number of files by top 20 detected languages

There are a lot of variations on the infobox and language templates, which makes counting them challenging, and thus our results are imperfect. For example, some files' descriptions are hidden in customized templates; some infobox templates use non-standard parameter names. See variations on the {{Information}} template for more examples.

File deletion

Tracked in Phabricator:
Task T177356

Methods

On October 12, 2017, we queried the filearchive table of Wikimedia Commons database and computed/extracted the following:

Number of deleters (users who have deleted at least one file) over time
How many files each user has deleted
Time to deletion, broken up by file type and reason for deletion (copyright violation vs other)
Reasons for deletion

Results

Number of users who've deleted at least one file.
Users' file deletion activity.

Time to deletion

Distribution of files' time-to-deletion, by media type and reason for deletion.

Reasons for deletion

This was the most difficult to get because of the variability in users' comments ("copyright" vs "copyvio"). The full logic is available for reference via the SQL query in the repository. The numbers below are an approximation, as there were bound to be reasons that were not caught by the logic. We recommend the creation of an interface that would have the most common deletion reasons so that when the user selects it, a standardized message is automatically inserted, which would make tracking these numbers much easier going forward.

Approximate breakdown of reasons for files deleted in 2017.

File search

Tracked in Phabricator:
Task T177358

Tracked in Phabricator:
Task T177534

We wanted to know which languages users search Wikimedia Commons in, and how satisfied they are.

Methods

We used Google's Compact Language Detector (CLDv3) and MediaWiki's TextCat (an updated version trained on Wikimedia queries and articles) to detect languages in queries of December 2017 on Commons. It should be noted that these numbers are to be taken with a HUGE pile of salt. Language detection on short text (such as search queries) is extremely difficult to get correct because of how few characters the algorithms have to make a prediction from. (Unlike, say, detecting language of an essay or a book.) This is especially problematic when the search query was a proper noun. For example, only CLDv3 returned a language prediction for "Wii u" (a Nintendo game console) but that prediction was "German", while the other methods such as TextCat and CLDv2 did not return anything. Or you have cases like "solar" being detected as Azerbaijani while in the same session none of the algorithms detected any language for "solar system".

We computed several desktop search metrics with event logging data in November 2017, and compare them with English Wikipedia. Specifically, we computed:

Zero results rate: Proportion of searches that did not yield any results. The lower, the better.
Click-through rate: Proportion of searches with at least one click on the search results. The higher, the better.
Proportion of searches with clicks to see other pages of the search results. The lower, the better.

Results

The following table shows the approximate proportion of search queries on Commons for each language:

language	approx. %
English	38.4243%
German	5.0856%
French	3.9716%
Latin	3.3742%
Italian	3.3097%
Spanish; Castilian	2.6316%
Portuguese	2.0665%
Norwegian	1.7598%
Dutch; Flemish	1.7113%
Danish	1.6629%
Afrikaans	1.5983%
Chinese	1.3400%
Catalan; Valencian	1.3077%
Japanese	1.2916%
Polish	1.1786%
Luxembourgish; Letzeburgesch	1.0494%
Serbian	1.0494%
Welsh	1.0171%
Galician	0.8557%
Estonian	0.8395%
Hindi	0.8395%
Finnish	0.8234%
Russian	0.7588%
Czech	0.7427%
Indonesian	0.7427%
Swedish	0.7265%
Western Frisian	0.7265%
Bulgarian	0.6942%
Hungarian	0.6619%
Ukrainian	0.6619%
Malagasy	0.6458%
Javanese	0.6296%
Greek, Modern (1453-)	0.5974%
Haitian; Haitian Creole	0.5974%
Romanian; Moldavian; Moldovan	0.5328%
Malay	0.5166%
Basque	0.5005%
Bosnian	0.5005%
Slovenian	0.5005%
Corsican	0.4682%
Esperanto	0.4682%
Sundanese	0.4521%
Turkish	0.4359%
Hausa	0.4198%
Lithuanian	0.3713%
Shona	0.3713%
Arabic	0.3552%
Igbo	0.3552%
Gaelic; Scottish Gaelic	0.3390%
Irish	0.3390%
Latvian	0.3390%
Azerbaijani	0.3229%
Belarusian	0.3229%
Chichewa; Chewa; Nyanja	0.3229%
Kirghiz; Kyrgyz	0.3229%
Korean	0.3229%
Somali	0.3229%
Maltese	0.3067%
Croatian	0.2583%
Sotho, Southern	0.2422%
Icelandic	0.2260%
Persian	0.2260%
Uzbek	0.2260%
Swahili	0.1937%
Xhosa	0.1937%
Samoan	0.1776%
Zulu	0.1614%
Kurdish	0.1453%
Yoruba	0.1453%
Kazakh	0.1292%
Macedonian	0.1130%
Mongolian	0.1130%
Slovak	0.1130%
Maori	0.0969%
Thai	0.0969%
Vietnamese	0.0969%
Albanian	0.0807%
Breton	0.0807%
Georgian	0.0807%
Kinyarwanda	0.0807%
Sindhi	0.0807%
Tagalog	0.0484%
Tajik	0.0484%
Bengali	0.0323%
Marathi	0.0323%
Pushto; Pashto	0.0323%
Armenian	0.0161%
Burmese	0.0161%
Central Khmer	0.0161%
Tamil	0.0161%
Urdu	0.0161%
Yiddish	0.0161%

7.51% of full-text searches on desktop did not yield any results on Commons, which is slightly lower than English Wikipedia.

Proportion of full-text searches on desktop that did not yield any results on Commons, with 95% confidence intervals.
Daily search-wise full-text zero results rates on desktop on Commons, with 95% credible intervals. Dashed line marks the overall zero results rate.

The full-text search results click-through rate is 10.42% on Commons, which is much lower than that on English Wikipedia. For autocomplete search, Commons' clickthrough rate is about 19%.

Desktop full-text search results clickthrough rates on Commons, with 95% credible intervals.
Daily search-wise full-text clickthrough rates on desktop on Commons, with 95% credible intervals. Dashed line marks the overall clickthrough rate.
Daily desktop autocomplete search clickthrough rates on Commons, with 95% credible intervals.

Users are much more likely to click to see other pages of search results on Commons (10.85%) than users on English Wikipedia (0.31%). Together with clickthrough rate, we believe that users are less likely to find what they want through search on Commons.

Proportion of desktop full-text searches with clicks to see other pages of the search results on Commons, with 95% credible intervals.
Daily proportion of desktop full-text searches with clicks to see other pages of the search results on Commons, with 95% credible intervals. Dashed line marks the overall proportion.

References

↑ "Statistics - Wikimedia Commons". commons.wikimedia.org. Retrieved 2019-02-26.
↑ "Commons:Structured data/About". Retrieved 14 December 2017.

[1] "Statistics - Wikimedia Commons". commons.wikimedia.org. Retrieved 2019-02-26.

[sdoc_about-2] "Commons:Structured data/About". Retrieved 14 December 2017.

[1]

[2]