Machine learning models/Proposed/Multilingual readability model card


This model generates scores to assess the readability of Wikipedia articles. The readability scores is a rough proxy to capture how difficult it is for a reader to understand the text of the article.

Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Mykola Trokhymovych and MGerlach_(WMF)
Model owner(s)MGerlach_(WMF)
Codetraining and inference
Uses PIINo
In production?No
This model uses article text to predict how hard it is for a reader to understand it.


Specifically, we propose a multilingual model using pre-trained xlm-roberta-longformer[1]. It supports not all but about 100 languages.

We fine-tune the model using annotated data of articles available in different readability levels. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We evaluate the model's performance on small annotated datasets available in a few languages using different children's encyclopedias (such as Vikidia).

Motivation

edit

As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. One of the goals is to identifying metrics to quantify the size of these gaps. This model attempts to provide a metric to measure readability of articles in Wikimedia projects; specifically focusing to provide multilingual support.

While there are readily available formulas to calculate readability of articles (such as the Flesch-Kincaid score), these formulas are often developed for a specific language (most commonly English). Usually, these formulas cannot be applied out of the box to other languages. As a result, it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.

You can find more details about the project here: Research:Multilingual Readability Research

Current supported languages

['af', 'an', 'ar', 'ast', 'azb', 'az', 'bar', 'ba', 'be', 'bg', 'bn', 'bpy', 'br', 'bs', 'ca', 'ceb', 'ce', 'cs', 'cv', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'fy', 'ga', 'gl', 'gu', 'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'id', 'io', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'kn', 'ko', 'ky', 'la', 'lb', 'lmo', 'lt', 'lv', 'mg', 'min', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', 'nds_nl', 'ne', 'new', 'nl', 'nn', 'no', 'oc', 'pa', 'pl', 'pms', 'pnb', 'pt', 'ro', 'ru', 'scn', 'sco', 'sh', 'sk', 'sl', 'sq', 'sr', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tl', 'tr', 'tt', 'uk', 'ur', 'uz', 'vi', 'vo', 'war', 'yo', 'zh', 'simple']

Users and uses

edit
Use this model for
  • Define the readability score of the Wikipedia article revision
  • Define the Flesch–Kincaid score of the article in multilingual setup
  • Compare the readability of different revisions of the same article
Don't use this model for
  • Making predictions on language editions of Wikipedia that are not in the listed languages or other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
  • Making predictions on namespaces outside of 0, disambiguation pages, and redirects
Current uses

This model is publicly available via LiftWing as of October 2023 but not currently incorporated in any products.

Ethical considerations, caveats, and recommendations

edit

The model only uses publicly available data of the content (i.e. plain text) extracted from the articles.

Nevertheless, there are certain caveats:

  • Multitlingual support: The model has only been trained on English data annotated with different readability levels. Our evaluation shows that the resulting model also works for other languages. However, performance varies across languages (see below). While this is a known issue for multilingual transformer models more generally[2], in the context of readability we are unable to systematically evaluate the model for many supported languages due to the lack of ground-truth data. In order to address this issues, we have started a research project to manually evaluate the model based on readers' perception of readability through surveys (ongoing).

Model

edit

The presented system is based on fine-tuned language model xlm-roberta-longformer[3] trained with ranking loss along with Linear Regression model [4] that transform model ranking score to the Flesch–Kincaid scoring scale. It is built on the paradigm of having one generalized model for all covered languages. The system includes the following steps:

1. Text features preparation:

  • Process wikitext and extract the revision text

2. Masked Language Models (MLM) outputs extraction:

  • Pass the text to the pre-trained ranking model (to extract ranking score)

3. Transform the ranking score to Flesch–Kincaid scale

  • Apply linear transformation to the ranking score. This score corresponds to a predicted Flesch-Kincaid grade level, i.e. a U.S. grade level capturing roughly "the number of years of education generally required to understand this text", that can be applied to other languages. The motivation is to provide a more interpretable score as an alternative to the ranking score obtained from the model.
 
Sketch of the model architecture consisting of two joint readability scoring models trained using a Margin Ranking Loss.


Performance

edit

We evaluate the model using the Ranking Accuracy metric, which shows how well the model can differentiate the easy and hard text versions.

The testing data consist of pairs of texts that correspond to the simple (easy) and difficult (hard) versions of one article (for example, the same article from English Wikipedia and Simple English Wikipedia). Even though we train the model only on English texts, we evaluate performance in other languages.

We evaluate model performance using Ranking Accuracy (RA), which is equal to the rate of correctly ranked pairs. Also, we provide the confidence intervals (CI).

Model performance metric and confidence interval
Dataset RA ±CI
simplewiki-en 0.976 0.002
vikidia-en 0.991 0.004
vikidia-ca 0.962 0.025
vikidia-de 0.938 0.03
vikidia-el 0.923 0.086
vikidia-es 0.911 0.013
vikidia-eu 0.818 0.032
vikidia-fr 0.923 0.005
vikidia-hy 0.802 0.036
vikidia-it 0.958 0.01
vikidia-oc 1.0 0.0
vikidia-pt 0.960 0.014
vikidia-ru 0.880 0.058
vikidia-scn 1.0 0.0
klexikon-de 0.999 0.002
txikipedia-eu 0.81 0.023
wikikids-nl 0.897 0.005


Implementation

edit
Model architecture

xlm-roberta-longformer (Readability scoring) model tunning:

  • Learning rate: 1e-5
  • Weight Decay: 1e-7
  • Epochs: 3
  • Loss: Margin Ranking Loss
  • Margin: 0.5
  • Maximum input length: 1500
Output schema
{
  lang: <language code string>,
  rev_id: <revision_id string>,
  score: {
     score: <Readability ranking score>
     fk_score_proxy: <Flesch–Kincaid score approximation>
  }
}
Example input and output

See LiftWing API for more details.

Example input:

curl https://api.wikimedia.org/service/lw/inference/v1/models/readability:predict -X POST -d '{"rev_id": 123456, "lang": "en"}' -H "Content-type: application/json"

Example output:

{
"model_name":"readability",
"model_version":"2",
"wiki_db":"enwiki",
"revision_id":1161100049,
"output":{
    "score":1.1111111111,
    "fk_score_proxy":11.1111
    }
}

Data

edit

Training data consist of pairs of texts that correspond to the articles in English Wikipedia and Simple English Wikipedia. We treat one of the texts in a pair as simple (easy) and another as difficult (hard). We split data into two parts: train (80%) and validation (20%).

Apart from the holdout dataset, we evaluate model performance in other languages. In particular, we use Vikidia pairs for it, oc, el, de, ru, es, en, ca, hy, scn, pt, fr, eu, Klexikon for de, wikikids for nl, Txikipedia for eu. This data is used only for model testing.


Data pipeline
Training data
  • Number of samples (pairs): 112342
  • Languages: en
Testing data
  • Number of samples (pairs): 58309
  • Languages: en, de, ca, el, es, eu, fr, hy, it, oc, pt, ru, scn, nl

Licenses

edit

Citation

edit

Preprint:

@misc{trokhymovych2024openmultilingualscoringreadability,
      title={An Open Multilingual System for Scoring Readability of Wikipedia}, 
      author={Mykola Trokhymovych and Indira Sen and Martin Gerlach},
      year={2024},
      eprint={2406.01835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.01835}, 
}
  1. https://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096
  2. Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16
  3. https://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096
  4. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html