Machine learning models/Production/Multilingual readability model card
This model generates scores to assess the readability of Wikipedia articles. The readability scores is a rough proxy to capture how difficult it is for a reader to understand the text of the article.
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Mykola Trokhymovych and MGerlach_(WMF) |
Model owner(s) | MGerlach_(WMF) |
Code | training and inference |
Uses PII | No |
In production? | No |
This model uses article text to predict how hard it is for a reader to understand it. | |
Specifically, we propose a multilingual model using pre-trained xlm-roberta-longformer[1]. It supports not all but about 100 languages.
We fine-tune the model using annotated data of articles available in different readability levels. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We evaluate the model's performance on small annotated datasets available in a few languages using different children's encyclopedias (such as Vikidia).
Motivation
editAs part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. One of the goals is to identifying metrics to quantify the size of these gaps. This model attempts to provide a metric to measure readability of articles in Wikimedia projects; specifically focusing to provide multilingual support.
While there are readily available formulas to calculate readability of articles (such as the Flesch-Kincaid score), these formulas are often developed for a specific language (most commonly English). Usually, these formulas cannot be applied out of the box to other languages. As a result, it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.
You can find more details about the project here: Research:Multilingual Readability Research
Users and uses
edit- Define the readability score of the Wikipedia article revision
- Define the Flesch–Kincaid score of the article in multilingual setup
- Compare the readability of different revisions of the same article
- Making predictions on language editions of Wikipedia that are not in the listed languages or other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
- Making predictions on namespaces outside of 0, disambiguation pages, and redirects
This model is publicly available via LiftWing as of October 2023 but not currently incorporated in any products.
- API endpoint on LiftWing: Get readability prediction
- Functional user interface on toolforge: Wiki-readability
Current supported languages |
---|
|
Ethical considerations, caveats, and recommendations
editThe model only uses publicly available data of the content (i.e. plain text) extracted from the articles.
Nevertheless, there are certain caveats:
- Multitlingual support: The model has only been trained on English data annotated with different readability levels. Our evaluation shows that the resulting model also works for other languages. However, performance varies across languages (see below). While this is a known issue for multilingual transformer models more generally[2], in the context of readability we are unable to systematically evaluate the model for many supported languages due to the lack of ground-truth data. In order to address this issues, we have started a research project to manually evaluate the model based on readers' perception of readability through surveys (ongoing).
Model
editThe presented system is based on fine-tuned language model xlm-roberta-longformer[3] trained with ranking loss along with Linear Regression model [4] that transform model ranking score to the Flesch–Kincaid scoring scale. It is built on the paradigm of having one generalized model for all covered languages. The system includes the following steps:
1. Text features preparation:
- Process wikitext and extract the revision text
2. Masked Language Models (MLM) outputs extraction:
- Pass the text to the pre-trained ranking model (to extract ranking score)
3. Transform the ranking score to Flesch–Kincaid scale
- Apply linear transformation to the ranking score. This score corresponds to a predicted Flesch-Kincaid grade level, i.e. a U.S. grade level capturing roughly "the number of years of education generally required to understand this text", that can be applied to other languages. The motivation is to provide a more interpretable score as an alternative to the ranking score obtained from the model.
Performance
editWe evaluate the model using the Ranking Accuracy metric, which shows how well the model can differentiate the easy and hard text versions.
The testing data consist of pairs of texts that correspond to the simple (easy) and difficult (hard) versions of one article (for example, the same article from English Wikipedia and Simple English Wikipedia). Even though we train the model only on English texts, we evaluate performance in other languages.
We evaluate model performance using Ranking Accuracy (RA), which is equal to the rate of correctly ranked pairs. Also, we provide the confidence intervals (CI).
Dataset | RA | ±CI |
---|---|---|
simplewiki-en | 0.976 | 0.002 |
vikidia-en | 0.991 | 0.004 |
vikidia-ca | 0.962 | 0.025 |
vikidia-de | 0.938 | 0.03 |
vikidia-el | 0.923 | 0.086 |
vikidia-es | 0.911 | 0.013 |
vikidia-eu | 0.818 | 0.032 |
vikidia-fr | 0.923 | 0.005 |
vikidia-hy | 0.802 | 0.036 |
vikidia-it | 0.958 | 0.01 |
vikidia-oc | 1.0 | 0.0 |
vikidia-pt | 0.960 | 0.014 |
vikidia-ru | 0.880 | 0.058 |
vikidia-scn | 0.9 | 0.191 |
klexikon-de | 0.999 | 0.002 |
txikipedia-eu | 0.81 | 0.023 |
wikikids-nl | 0.897 | 0.006 |
Implementation
editxlm-roberta-longformer (Readability scoring) model tunning:
- Learning rate: 1e-5
- Weight Decay: 1e-7
- Epochs: 3
- Loss: Margin Ranking Loss
- Margin: 0.5
- Maximum input length: 1500 tokens
{
lang: <language code string>,
rev_id: <revision_id string>,
score: {
score: <Readability ranking score>
fk_score_proxy: <Flesch–Kincaid score approximation>
}
}
See LiftWing API for more details.
Example input:
curl https://api.wikimedia.org/service/lw/inference/v1/models/readability:predict -X POST -d '{"rev_id": 123456, "lang": "en"}' -H "Content-type: application/json"
Example output:
{
"model_name":"readability",
"model_version":"2",
"wiki_db":"enwiki",
"revision_id":1161100049,
"output":{
"score":1.1111111111,
"fk_score_proxy":11.1111
}
}
Data
editTraining data consist of pairs of texts that correspond to the articles in English Wikipedia and Simple English Wikipedia. We treat one of the texts in a pair as simple (easy) and another as difficult (hard). We split data into two parts: train (80%) and validation (20%).
Apart from the holdout dataset, we evaluate model performance in other languages. In particular, we use Vikidia pairs for it, oc, el, de, ru, es, en, ca, hy, scn, pt, fr, eu, Klexikon for de, wikikids for nl, Txikipedia for eu. This data is used only for model testing.
- Number of samples (pairs): 112342
- Languages: en
- Number of samples (pairs): 58309
- Languages: en, de, ca, el, es, eu, fr, hy, it, oc, pt, ru, scn, nl
Licenses
edit- Code: GNU General Public License v2.0
- Model: Apache 2.0 License
Citation
editPreprint:
@misc{trokhymovych2024openmultilingualscoringreadability,
title={An Open Multilingual System for Scoring Readability of Wikipedia},
author={Mykola Trokhymovych and Indira Sen and Martin Gerlach},
year={2024},
eprint={2406.01835},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.01835},
}
- ↑ https://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096
- ↑ Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16
- ↑ https://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096
- ↑ https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html