User:MPopov (WMF)/Notes/Incubator test wikis

Motivation: Caroline's question posted in #working-with-data in Slack

Question: How many RTL languages have test wikis in the Incubator

Answer:

As of June 19, 2024
Language directionality	Languages with 1+ test wiki(s)
Cyrillic (LTR?)	1 (neg)
Vertical (Letters: TTB, Lines: LTR)	3
Left-to-right	576
Right-to-left	31

Caveat: These counts only include test wikis which satisfy the following criteria:

They are substantial (having at least 25 mainspace pages), and/or
They are active (having some active mainspace pages creation since the beginning of 2023).

Methodology

The data will come entirely from https://incubator.wikimedia.org/wiki/Incubator:Wikis which we will "scrape" using JavaScript and analyze separately.

When the page loads all the test wikis are collapsed/hidden. Each one can be expanded by clicking "[show]" link which will load the testwiki information. We can show/expand all of them by running the following JavaScript code in console which will trigger the click event on each [show] link:

$("td a.att-toggle:contains('[show]')").each(function(index) {
  $(this).click()
})

It will take a minute or two to load all of the testwikis' information. Now that they have all loaded, we can extract the ISO 639-3 language code and directionality info for each testwiki, storing those two pieces of data in the array testwiki_languages:

var testwiki_languages = [];

$(".testwiki-language").each(function(index) {
  var lang = {
      iso_639_3: $(this).find("kbd a").text(),
      directionality: $(this).find("ul li:contains('Directionality')").text().replace('Directionality: ', '')
  };
  testwiki_languages.push(lang)
})

To get that data into R or Python, we need to stringify it into a JSON representation:

JSON.stringify(testwiki_languages)

We can copy the output by right-clicking in the console and selecting Copy Message. For data analysis in Python you would use:

import pandas as pd

testwiki_languages = pd.read_json('[{"iso_639_3":…]')

But we are going to do the analysis with R:

library(jsonlite)

testwiki_languages <- fromJSON('[{"iso_639_3":…]')

(In both of these cases the full and rather lengthy string is omitted.) Finally, let's count languages by directionality – keeping in mind that due to how we compiled our dataset it will have duplicates of languages if there are multiple projects incubating for a language (e.g. Moroccan Arabic Wikibooks, Moroccan Arabic Wikiquote, Moroccan Arabic Wiktionary):

library(tidyverse)

testwiki_languages |>
  distinct(iso_639_3, directionality) |>
  count(directionality)