Research:Baseline Metrics for Structured Data on Wikimedia Commons
Wikimedia Commons, a sister project of Wikipedia, is a collection of more than 50 million free media files as of February 2019[1]. The project Structured Data on Wikimedia Commons (SDoC) converts information about these files to a structured and machine-readable format – making them easier to view, search, edit, organize and re-use – in many languages. This is implemented with Wikibase, the same technology as used for Wikidata. Wikimedia community members and staff from the Wikimedia Foundation (WMF) and Wikimedia Deutschland (WMDE) (the Wikidata team) will work on this project from 2017 till the end of 2019.[2]
In order to measure the effectiveness of new functionalities on Wikimedia Commons, we need to establish relevant criteria that can be measured and a (2017) baseline against which we can compare in the future. After discussing with the SDoC team, we decided to compute metrics in the following aspects:
- File contribution: How many files are uploaded by bots vs users?
- File type: Number of files by type and their trend over time
- File description: Information box usage, categorization, and languages
- File deletion: Time to deletion and reason of deletion
- File search: How many search queries happen in what languages? How satisfied are users with their searches on Commons?
File contribution
editWe are interested in comparing the number of files uploaded by bots and by users.
Methods
editOn October 12, 2017, we queried the image table of Wikimedia Commons database and counted the number of files uploaded by bots vs others. Then we broke down the counts by media type (defined by img_media_type
field in image table) and by month. We identified user accounts that belong to bot
group in user_groups table or user_former_groups table, and user accounts that belong to categories whose name match _(bot_flag|bots)(_|$)
, as bot accounts. Some bots are operated by institutions, while some are automatic tools like Flickr upload bot.
Results
editAs of October 12, 2017, the number of files uploaded by bots was 9,390,721 (22.03%), and the number of files uploaded by users was 33,241,541 (77.97%). The following table breaks down the counts by media type:
Media Type | User Group | Number of Files | Proportion |
---|---|---|---|
bitmap | user | 31355343 | 73.55% |
bitmap | bot | 8843447 | 20.74% |
drawing | user | 905964 | 2.13% |
drawing | bot | 270516 | 0.63% |
audio | user | 698566 | 1.64% |
audio | bot | 95646 | 0.22% |
video | user | 71738 | 0.17% |
video | bot | 36329 | 0.09% |
multimedia | user | 4 | 0% |
office | user | 209926 | 0.49% |
office | bot | 144783 | 0.34% |
The following two graphs break down the counts by month:
-
Number of files uploaded by bots vs by users
-
Proportion of files uploaded by bots vs by users
File type
editWe are interested in the distribution of file types and extensions uploaded by users over time, including how many files of each type are uploaded on a monthly basis.
Methods
editOn October 12, 2017, we queried the image table of Wikimedia Commons database and counted files uploaded by extension per day.
Results
editAs of October 12, 2017, the majority of files uploaded to Wikimedia Commons were images (see the treemap on the side) with nearly 37M JPG/JPEG, 2.3M PNG, and 1.2M SVG files total. Those file formats are also the most voluminous uploads on a monthly basis (with daily upload counts available in the repository):
-
Cumulative upload counts by file extension.
-
Uploads per month by file extension.
The table below shows the full breakdown by media type and extension as of October 12th, 2017:
media | extension | uploads |
---|---|---|
audio | ogg | 773305 |
audio | oga | 6180 |
audio | flac | 6140 |
audio | mid | 4993 |
audio | wav | 3512 |
audio | opus | 410 |
docs | 354765 | |
docs | djvu | 60524 |
image | jpg/jpeg | 36918799 |
image | png | 2268026 |
image | svg | 1176530 |
image | tif/tiff | 807921 |
image | gif | 153959 |
image | xcf | 1008 |
image | webp | 95 |
video | ogv | 66610 |
video | webm | 41161 |
File description
editTypically, every media file on Commons is accompanied by plain-text descriptions (wikitext, templates) and categories. The findability of files is largely determined by this information. Hence, we need to figure out the information box usage, categorization, and language usage of files on Commons.
Methods
editOn December 12, 2017, we queried the image table of the Wikimedia Commons database and counted the number of files by the number of categories they belong to. We excluded hidden categories by querying the page_props table, and excluded "needing_category" categories by querying the categorylinks table, from the counts.
In order to figure out information box and language usage of files on Commons, we parsed the wikitext of all files in the Commons xml data dumps of November 20, 2017 with the mwparserfromhell library to extract their infobox templates and language templates (e.g. {{en}}
, {{LangSwitch}}
). It is worth noting that the way we extracted infobox templates is different from the CommonsMetadata extension. We parsed the wikitext instead of the HTML, and included not only standard infobox templates but also templates that have a description field.
For files without a language template, we used the langdetect package to detect their languages.
Results
editThere are 1,629,592 (3.73%) files on Commons that don't belong to any category. The following graph shows the number of files by the number of categories:
-
Number of files by number of categories
The following table breaks down the counts further by media type:
Media Type | Categories | Number of Files | Proportion |
---|---|---|---|
AUDIO | 0 category | 2007 | 0.25% |
AUDIO | 1 category | 404697 | 50.54% |
AUDIO | 2 categories | 346496 | 43.27% |
AUDIO | 3 categories | 15327 | 1.91% |
AUDIO | 4 categories | 6667 | 0.83% |
AUDIO | 5+ categories | 25552 | 3.19% |
BITMAP | 0 category | 1599973 | 3.89% |
BITMAP | 1 category | 21292109 | 51.73% |
BITMAP | 2 categories | 9944133 | 24.16% |
BITMAP | 3 categories | 4379117 | 10.64% |
BITMAP | 4 categories | 1886515 | 4.58% |
BITMAP | 5+ categories | 2057464 | 5% |
DRAWING | 0 category | 11228 | 0.94% |
DRAWING | 1 category | 485009 | 40.5% |
DRAWING | 2 categories | 358924 | 29.97% |
DRAWING | 3 categories | 149269 | 12.46% |
DRAWING | 4 categories | 118525 | 9.9% |
DRAWING | 5+ categories | 74735 | 6.24% |
MULTIMEDIA | 1 category | 2 | 50% |
MULTIMEDIA | 2 categories | 1 | 25% |
MULTIMEDIA | 3 categories | 1 | 25% |
OFFICE | 0 category | 3869 | 1.06% |
OFFICE | 1 category | 285574 | 78.54% |
OFFICE | 2 categories | 38990 | 10.72% |
OFFICE | 3 categories | 24767 | 6.81% |
OFFICE | 4 categories | 5918 | 1.63% |
OFFICE | 5+ categories | 4475 | 1.23% |
VIDEO | 0 category | 12515 | 11.31% |
VIDEO | 1 category | 25489 | 23.04% |
VIDEO | 2 categories | 19412 | 17.55% |
VIDEO | 3 categories | 13702 | 12.39% |
VIDEO | 4 categories | 10028 | 9.06% |
VIDEO | 5+ categories | 29483 | 26.65% |
Out of the total 43,268,565 files, 41,796,560 (96.6%) files have an infobox, 41,309,028 (95.47%) files have some contents in their description fields (description, title, depicted people, depicted place, etc.) as of November 20, 2017.
There are 14,848,551 (34.32%) files that don't have any language templates, and 23,780,247 (54.96%) files use only 1 language template. 40.1% of all files have English templates, 9.38% of files use German, and 6.2% of files have descriptions in languages which are not in the top 20.
-
The number of files by the number of language templates on Commons
-
The number of files by the top 20 language templates on Commons
For files without language templates, we detected 1 language for 7,577,789 (17.51% of all 43,268,565 files) files. There are 556,684 files (1.29%) in which no language was detected. English was detected in 30.25% of all 43,268,565 files.
-
For commons files without language templates, the number of files by the number of detected languages
-
For commons files without language templates, the number of files by top 20 detected languages
There are a lot of variations on the infobox and language templates, which makes counting them challenging, and thus our results are imperfect. For example, some files' descriptions are hidden in customized templates; some infobox templates use non-standard parameter names. See variations on the {{Information}} template for more examples.
File deletion
editMethods
editOn October 12, 2017, we queried the filearchive table of Wikimedia Commons database and computed/extracted the following:
- Number of deleters (users who have deleted at least one file) over time
- How many files each user has deleted
- Time to deletion, broken up by file type and reason for deletion (copyright violation vs other)
- Reasons for deletion
Results
edit-
Number of users who've deleted at least one file.
-
Users' file deletion activity.
Time to deletion
edit-
Distribution of files' time-to-deletion, by media type and reason for deletion.
Reasons for deletion
editThis was the most difficult to get because of the variability in users' comments ("copyright" vs "copyvio"). The full logic is available for reference via the SQL query in the repository. The numbers below are an approximation, as there were bound to be reasons that were not caught by the logic. We recommend the creation of an interface that would have the most common deletion reasons so that when the user selects it, a standardized message is automatically inserted, which would make tracking these numbers much easier going forward.
-
Approximate breakdown of reasons for files deleted in 2017.
File search
editWe wanted to know which languages users search Wikimedia Commons in, and how satisfied they are.
Methods
editWe used Google's Compact Language Detector (CLDv3) and MediaWiki's TextCat (an updated version trained on Wikimedia queries and articles) to detect languages in queries of December 2017 on Commons. It should be noted that these numbers are to be taken with a HUGE pile of salt. Language detection on short text (such as search queries) is extremely difficult to get correct because of how few characters the algorithms have to make a prediction from. (Unlike, say, detecting language of an essay or a book.) This is especially problematic when the search query was a proper noun. For example, only CLDv3 returned a language prediction for "Wii u" (a Nintendo game console) but that prediction was "German", while the other methods such as TextCat and CLDv2 did not return anything. Or you have cases like "solar" being detected as Azerbaijani while in the same session none of the algorithms detected any language for "solar system".
We computed several desktop search metrics with event logging data in November 2017, and compare them with English Wikipedia. Specifically, we computed:
- Zero results rate: Proportion of searches that did not yield any results. The lower, the better.
- Click-through rate: Proportion of searches with at least one click on the search results. The higher, the better.
- Proportion of searches with clicks to see other pages of the search results. The lower, the better.
Results
editThe following table shows the approximate proportion of search queries on Commons for each language:
language | approx. % |
---|---|
English | 38.4243% |
German | 5.0856% |
French | 3.9716% |
Latin | 3.3742% |
Italian | 3.3097% |
Spanish; Castilian | 2.6316% |
Portuguese | 2.0665% |
Norwegian | 1.7598% |
Dutch; Flemish | 1.7113% |
Danish | 1.6629% |
Afrikaans | 1.5983% |
Chinese | 1.3400% |
Catalan; Valencian | 1.3077% |
Japanese | 1.2916% |
Polish | 1.1786% |
Luxembourgish; Letzeburgesch | 1.0494% |
Serbian | 1.0494% |
Welsh | 1.0171% |
Galician | 0.8557% |
Estonian | 0.8395% |
Hindi | 0.8395% |
Finnish | 0.8234% |
Russian | 0.7588% |
Czech | 0.7427% |
Indonesian | 0.7427% |
Swedish | 0.7265% |
Western Frisian | 0.7265% |
Bulgarian | 0.6942% |
Hungarian | 0.6619% |
Ukrainian | 0.6619% |
Malagasy | 0.6458% |
Javanese | 0.6296% |
Greek, Modern (1453-) | 0.5974% |
Haitian; Haitian Creole | 0.5974% |
Romanian; Moldavian; Moldovan | 0.5328% |
Malay | 0.5166% |
Basque | 0.5005% |
Bosnian | 0.5005% |
Slovenian | 0.5005% |
Corsican | 0.4682% |
Esperanto | 0.4682% |
Sundanese | 0.4521% |
Turkish | 0.4359% |
Hausa | 0.4198% |
Lithuanian | 0.3713% |
Shona | 0.3713% |
Arabic | 0.3552% |
Igbo | 0.3552% |
Gaelic; Scottish Gaelic | 0.3390% |
Irish | 0.3390% |
Latvian | 0.3390% |
Azerbaijani | 0.3229% |
Belarusian | 0.3229% |
Chichewa; Chewa; Nyanja | 0.3229% |
Kirghiz; Kyrgyz | 0.3229% |
Korean | 0.3229% |
Somali | 0.3229% |
Maltese | 0.3067% |
Croatian | 0.2583% |
Sotho, Southern | 0.2422% |
Icelandic | 0.2260% |
Persian | 0.2260% |
Uzbek | 0.2260% |
Swahili | 0.1937% |
Xhosa | 0.1937% |
Samoan | 0.1776% |
Zulu | 0.1614% |
Kurdish | 0.1453% |
Yoruba | 0.1453% |
Kazakh | 0.1292% |
Macedonian | 0.1130% |
Mongolian | 0.1130% |
Slovak | 0.1130% |
Maori | 0.0969% |
Thai | 0.0969% |
Vietnamese | 0.0969% |
Albanian | 0.0807% |
Breton | 0.0807% |
Georgian | 0.0807% |
Kinyarwanda | 0.0807% |
Sindhi | 0.0807% |
Tagalog | 0.0484% |
Tajik | 0.0484% |
Bengali | 0.0323% |
Marathi | 0.0323% |
Pushto; Pashto | 0.0323% |
Armenian | 0.0161% |
Burmese | 0.0161% |
Central Khmer | 0.0161% |
Tamil | 0.0161% |
Urdu | 0.0161% |
Yiddish | 0.0161% |
7.51% of full-text searches on desktop did not yield any results on Commons, which is slightly lower than English Wikipedia.
-
Proportion of full-text searches on desktop that did not yield any results on Commons, with 95% confidence intervals.
-
Daily search-wise full-text zero results rates on desktop on Commons, with 95% credible intervals. Dashed line marks the overall zero results rate.
The full-text search results click-through rate is 10.42% on Commons, which is much lower than that on English Wikipedia. For autocomplete search, Commons' clickthrough rate is about 19%.
-
Desktop full-text search results clickthrough rates on Commons, with 95% credible intervals.
-
Daily search-wise full-text clickthrough rates on desktop on Commons, with 95% credible intervals. Dashed line marks the overall clickthrough rate.
-
Daily desktop autocomplete search clickthrough rates on Commons, with 95% credible intervals.
Users are much more likely to click to see other pages of search results on Commons (10.85%) than users on English Wikipedia (0.31%). Together with clickthrough rate, we believe that users are less likely to find what they want through search on Commons.
-
Proportion of desktop full-text searches with clicks to see other pages of the search results on Commons, with 95% credible intervals.
-
Daily proportion of desktop full-text searches with clicks to see other pages of the search results on Commons, with 95% credible intervals. Dashed line marks the overall proportion.
References
edit- ↑ "Statistics - Wikimedia Commons". commons.wikimedia.org. Retrieved 2019-02-26.
- ↑ "Commons:Structured data/About". Retrieved 14 December 2017.