Research:Multilingual Readability Research/Evaluation multilingual model
An improved multilingual model to measure readability of Wikipedia articles.
Background
editThe aim of this project is to build a model to predict the readability scores of Wikipedia articles in different languages. Specifically, we take advantage of multilingual large language models which support around 100 languages. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We then evaluate how well the model generalizes to other corpora (in English) as well as other languages using a smaller multilingual corpus of encyclopedic articles with two readability levels consisting of articles in Wikipedia and their respective article from different children’s encyclopedias.
Method
editData
editWe use the multilingual dataset of encyclopedic articles available in two different readability levels (see Research:Multilingual Readability Research#Generating a multilingual dataset):
- difficult: Wikipedia article from a specific language
- simple: matched article in the same language from children's encyclopedia
Language | Dataset | # pairs |
---|---|---|
en | Simplewiki | 109,152 |
en | Vikidia | 1,994 |
ca | Vikidia | 244 |
de | Klexikon | 2,259 |
de | Vikidia | 273 |
el | Vikidia | 41 |
es | Vikidia | 2,450 |
eu | Txikipedia | 2,649 |
eu | Vikidia | 1,059 |
fr | Vikidia | 12,675 |
hy | Vikidia | 550 |
it | Vikidia | 1,704 |
nl | Wikikids | 11,319 |
oc | Vikidia | 7 |
pt | Vikidia | 598 |
ru | Vikidia | 104 |
scn | Vikidia | 11 |
Model
editWe represent out task as a binary classification problem: The simple version of the text is labeled as 0 (negative), difficult text as a 1 (positive). The model is trained to classify whether an article contains the simple or difficult label. The output probability can be interpreted as a readability score.
For our task we are using bert-base-multilingual-cased model. It is a pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. We tune this model for binary classification task, using the training part of dataset observed previously. It is important to mention that tuning of multilingual model is performed using only English texts.
We fine-tune the Masked Language Model (MLM) for binary classification task. Instead of taking the whole text as an input, we decided to work on sentence level. We treat each sentence as an independent sample with the label of corresponding text. During inference, we are getting predictions for each independent sentence and use mean pooling to get the prediction for the whole text. There are few reasons for such decision: (i) the full text can be long, and as a result not fit into memory while tuning; (ii) the length of text represent unwanted leakage to the model, as hard and easy texts have different length distributions.
Even though we train data for sentence level prediction, we do the evaluation on text level as it represents the model usage scenario. We use a simple accuracy metric for our solution evaluation. It is well interpretable, and we can use it as we have perfectly balanced data, where each hard sample has the corresponding easy text.
Training and test split
editTraining data consist of pairs of texts that correspond to the articles in Simple English Wikipedia and English Wikipedia. We treat one of the texts in a pair as a simple (easy) and another as a difficult (hard). Each text represented as a list of sentences.
We use such a data for our experiment. We split data in three parts: train (80%), validation (10%), test (10%). Important detail is that we include different versions of the article only to one data part (train, test or validation).
Apart from holdout dataset we evaluate model performance on other languages (see above). However, we train only on English text, as there are no enough samples to use all mentioned languages for training.
Evaluation metric
editHaving balanced dataset (the same quantity for hard and easy samples, as they represent pairs) we are using accuracy as the main metric for the evaluation. Also, we are using AUC as an additional metric for evaluation.
Results
editModel evaluation
editWe evaluate the model using accuracy and AUC for each dataset in the different languages (model was only trained using the 80% sample of the training data of Simplewiki in English).
We can conclude that the best model performs on English, which is the language, that was used in training. However, having multilingual base model, we also generalize the knowledge, so the model performs not random on other languages. For example, we have 0.73 for fr, 0.72 for nl, 0.76 for it. The worst performance is for eu (Basque) and el (Greek).
Language | Dataset | accuracy | AUC |
---|---|---|---|
en | Simplewiki (test set) | 0.891352 | 0.955451 |
en | Vikidia | 0.921013 | 0.982656 |
ca | Vikidia | 0.860656 | 0.914270 |
de | Klexikon | 0.757636 | 0.948942 |
de | Vikidia | 0.690476 | 0.872446 |
el | Vikidia | 0.524390 | 0.761154 |
es | Vikidia | 0.702041 | 0.822553 |
eu | Txikipedia | 0.425975 | 0.386073 |
eu | Vikidia | 0.579792 | 0.611134 |
fr | Vikidia | 0.731558 | 0.826539 |
hy | Vikidia | 0.535455 | 0.695755 |
it | Vikidia | 0.763791 | 0.856777 |
nl | Wikikids | 0.715346 | 0.788743 |
oc | Vikidia | 0.571429 | 0.795918 |
pt | Vikidia | 0.811037 | 0.908483 |
ru | Vikidia | 0.701923 | 0.837555 |
scn | Vikidia | 0.636364 | 0.752066 |
One more observation is that model performance is different for samples of different length (number of sentences). The table represents the accuracy for samples of different length of texts in number of sentences. The longer the text the more accurate are our predictions.
We also observe that the readability scores from the multilingual BERT model separate articles from the simple (0) and difficult (1) class. In fact, we see that the separation is more pronounced that using standard Flesch-Kincaid readability formula for English
Comparison to language-agnostic model
editComparison to the language-agnostic model (we re-ran the language-agnostic model with the same train-test splits and evaluation data).
- The multilingual BERT model substantially improves the accuracy of the language-angostic model; the only exception are the two datasets in de (German)
- The multilingual BERT model supports more languages. Some languages (el: Greek, eu: Basque, hy: Armenian, oc: Occitan, scn: Sicilian) are currently not supported by the language-agnostic model because we do not have a working entity-linking model generating the language-agnostic representations.
Language | Dataset | acc multilingual BERT | acc Language-agnostic |
---|---|---|---|
en | Simplewiki (test set) | 0.891 | 0.763 |
en | Vikidia | 0.921 | 0.863 |
ca | Vikidia | 0.860 | 0.809 |
de | Klexikon | 0.757 | 0.823 |
de | Vikidia | 0.690 | 0.793 |
el | Vikidia | 0.524 | - |
es | Vikidia | 0.702 | 0.680 |
eu | Txikipedia | 0.425 | - |
eu | Vikidia | 0.579 | - |
fr | Vikidia | 0.731 | 0.637 |
hy | Vikidia | 0.535 | - |
it | Vikidia | 0.763 | 0.741 |
nl | Wikikids | 0.715 | 0.562 |
oc | Vikidia | 0.571 | - |
pt | Vikidia | 0.811 | 0.809 |
ru | Vikidia | 0.701 | 0.610 |
scn | Vikidia | 0.636 | - |
Additional experiments
editWe also performed the various experiments aiming to improve the performance of the model that we observe here but dont show detailed results.
- We also tried out the finetuning MLM with longformer architecture, that allows to pass the whole text as an input instead of per-sentence prediction. It showed comparable performance to the sentence-based approach. However, we decided not to proceed with it, as the model is more difficult to maintain, needs more memory and not show boost in performance.
- Another experiment was based on attempt to extend the training dataset with translated texts. As we don’t have enough date for most of languages except English, the idea was to artificially create such a dataset, tune the model based on it and evaluate the results based on testing dataset of real texts. The results showed that adding translated texts decrease performance on original texts (English), with no reliable increase of performance on languages of translated texts. We used facebook/m2m100_418M for texts translation.
- One more experiment was adding a small amount of original texts from other languages to the train. It shoved the slightly increase of performance for added languages. However, we decided not to proceed with this experiment, as we don’t really have data for such model training procedure.
Resources
edit- Table with predicted readability scores for (almost) all articles of the 2023-06 snapshot of all supported Wikipedias: link to download (this is a one-off dataset)
- Code: https://gitlab.wikimedia.org/repos/research/readability
- Code for model on LiftWing:https://gitlab.wikimedia.org/trokhymovych/readability-liftwing
- Model card: Machine learning models/Proposed/Multilingual readability model card