Movement Insights/Wiki comparison

The wiki comparison tool shows a simple, snapshot comparison of our wikis (unlike, say, Wikistats, which is meant to show simple trends within individual wikis or wiki groups).

Suggested statistics

edit

This page lists suggested additional statistics for the wiki comparisons dataset. Please add your own ideas or vote for ideas that are already here!

Statistic Notes Votes
Mobile retention rate Needs some data infrastructure (e.g. editor_day table) Neil, Jan
Uses language variants? Not sure what the best way to get this list this is
Fundraising revenue Would help us understand places where major reader-facing changes could potentially impact fundrasing
Number of associated countries Probably based on the percentage of editors and readers from the country; needs analysis to decide where to set the threshold; perhaps should be split into major and minor countries Jan, Dana
Top country Jan, Dana
Ratio of top associated country to second associated country Jan
wiki creation date In progress: T336999.
monthly registration
very active editors editors with 100 or more content edits in a month Jan, Denis
uses flagged revisions
has ArbCom Relevant profiles https://office.wikimedia.org/wiki/Trust_and_Safety/ArbCom_maps

Overall https://meta.wikimedia.org/wiki/Arbitration_Committee

Jan, Sydney
has oversighters https://meta.wikimedia.org/wiki/Oversight_policy#Requests_for_oversight Jan
has checkusers https://meta.wikimedia.org/wiki/CheckUser_policy#Access_to_CheckUser Jan
Global South/Emerging Communities traffic percentage Jan, Dana
speakers to editors ratio Would require collecting (above) speaker population, but that will be very difficult to gather for anything more than top 20 or so languages; Ethnologue sells the best dataset Jan
speakers to mobile devices ratio also requires speaker populations Jan
language health Ethnologue's dataset also includes a measure of language vitality: https://www.sil.org/about/endangered-languages/language-vitality Neil
number of user accounts blocks - by reason Community health initiative looked at one week of total block # by wikis https://docs.google.com/spreadsheets/d/1_4GZ2WUurxaehlNeab5mF7VDgOgHd-PXjFjsLD5tY2Q/edit#gid=1703473757- good data but needs to be separated by reason to be more meaningful Sydney, Jan
talk page edits proportion
number of talk page editors
median article quality velocity and acceleration would also be interesting if can be directly queried with ease Adam
volume in important articles velocity and acceleration would also be interesting if can be directly queried with ease Adam
articles injected from translation velocity and acceleration would also be interesting if can be directly queried with ease Adam
multimedia (by type) coverage velocity and acceleration would also be interesting if can be directly queried with ease Adam
mobile Android app edits this is to just be more granular. velocity and acceleration would also be interesting if can be directly queried with ease Adam
mobile iOS app edits this is to just be more granular. velocity and acceleration would also be interesting if can be directly queried with ease Adam
mobile web edits this is to just be more granular. velocity and acceleration would also be interesting if can be directly queried with ease Adam
citation (by type) coverage velocity and acceleration would also be interesting if can be directly queried with ease Adam
external referer count velocity and acceleration would also be interesting if can be directly queried with ease Adam
Logged in page views, mobile and desktop distribution of page views between logged in and non-logged in users Margeigh
% of editors who edit other wikis might help distinguish e.g. Meta, MediaWiki.org, Commons from other wikis
new content pages
Link this to country-specific data (like that in https://docs.google.com/spreadsheets/d/1AMUiZ4z3CCSBClmEU8T6JpASJd2_D4vfR674B2c6NCg/edit#gid=0). It won't be very hard to decide which countries are associated with which wikis, but it's not clear how to weigh country data back up into a per-wiki value. Neil, Adam
Median page load time or other connection quality metric as recommended by the Performance team (e.g. like this but per wiki instead of per country: https://commons.wikimedia.org/wiki/File:Median_Wikipedia_page_load_times_by_country_(desktop%2Bmobile,_enwiki,_Dec_2015-Jan_2016).svg ) Tilman, Quiddity
Anon edits broken down by platform (mobile web, mobile app, desktop) Rita
Registered users Ease sorting and looking up.
Community growth % For the last three years (e.g. from 2019-2022), the last year (2022-2021) and the year before (2021-2020). Denis

Administration

edit

The code generating the data snapshots lives at github.com/wikimedia-research/wiki-comparison.

We update this dataset about every year, although it can be updated as often as we choose. When we do, we generally continue to make the old versions accessible in other tabs in the Google Spreadsheet), so people can perform comparisons if they wish. However, this is a "bonus" feature and we do not guarantee that old snapshots will be kept.

The update process is as follows:

  1. Generate a new snapshot by running data-collection/data-collection.ipynb. Note that you should always select for SNAPSHOT the latest month with a completed mediawiki_history snapshot; a snapshot ending in December is no better than one ending in March.
  2. Put up your changes as a pull request.
  3. Make a copy of the wiki comparison spreadsheet.
  4. Add the new snapshot to the copy by making a copy of the previous sheet and then pasting the new data (Edit > Paste special > Values only) in the columns from B to the end. This preserves all the nice formatting.
  5. Have someone review the data in the new snapshot as well as any changes you made to the generation code.
  6. When satisfied, the reviewer merges the pull request.
  7. Copy the new snapshot to the main spreadsheet. Make sure to protect the whole sheet (Data > Protect sheets and ranges) with the "Can edit (with warnings)" level. This ensures that those with write access to the spreadsheet do not accidentally change data or filter the data for everyone when they are just trying to use the tool for themself.
  8. Announce the new snapshot to #general on the Wikimedia Foundation Slack. Include some background on what the tool is for, to entice new folks to use it.

Background

edit

The following was written in March 2024 to share background on the wiki comparison tool in a Wikimedia Foundation staff news letter:

The wiki comparison tool is a public Google Sheet where you can sort and filter the huge list of Wikimedia wikis using 26 different dimensions, from the retention rate of new editors to the percentage of page views that come from mobile apps.

It originated in 2018 as part of an effort to develop standard categories and clusters of wiki to help Foundation staff to target programs and drill into metrics. Unfortunately, most of the project was canceled, but not before the Product Analytics team achieved the first step of making a simple tool to allow people to explore the source data it had collected: the wiki comparison tool.

Making the tool was actually a relatively simple task for Product Analytics. Almost all the statistics had already been calculated in one place or another (surprisingly, one of the hardest parts was actually collecting the English names for languages and wikis), and Google Sheets provided a near-perfect, zero-maintenance interface.

Despite the simplicity, many of the statistics it collects aren’t available anywhere else without manual querying by a data analyst. In an average month, about 25 users consult it at least once (the record is 122, in February 2022, when that year’s update was announced). The tool also birthed the canonical wiki dataset, which makes it easy for researchers and analysts to connect different data sets and to translate codes like “euwiki” into names like “Basque Wikipedia”.

Since its creation, the maintainers (previously Product Analytics, now Movement Insights) have updated the data each year and added a handful of new metrics. There aren’t any concrete plans to develop it further, but there are plenty of dimensions that could be added.

The tool serves the same purpose as the many tables of wikis on Meta, which are maintained by volunteers. While wiki comparison has a more useful interface and more metrics (many of which are extremely difficult to calculate without our internal data infrastructure), most of the tables on Meta are bot-updated several times a day, so they’re much faster moving. The project team hopes that in the future, these approaches can be unified so everyone in the movement gets the best of both worlds.