Research:Multilingual Readability Research/Evaluation multilingual model

An improved multilingual model to measure readability of Wikipedia articles.

Background

edit

The aim of this project is to build a model to predict the readability scores of Wikipedia articles in different languages. Specifically, we take advantage of multilingual large language models which support around  100 languages. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We then evaluate how well the model generalizes to other corpora (in English) as well as other languages using a smaller multilingual corpus of encyclopedic articles with two readability levels consisting of articles in Wikipedia and their respective article from different children’s encyclopedias.

Method

edit

Data

edit

We use the multilingual dataset of encyclopedic articles available in two different readability levels (see Research:Multilingual Readability Research#Generating a multilingual dataset):

  • difficult: Wikipedia article from a specific language
  • simple: matched article in the same language from children's encyclopedia
Language Dataset # pairs
en Simplewiki 109,152
en Vikidia 1,994
ca Vikidia 244
de Klexikon 2,259
de Vikidia 273
el Vikidia 41
es Vikidia 2,450
eu Txikipedia 2,649
eu Vikidia 1,059
fr Vikidia 12,675
hy Vikidia 550
it Vikidia 1,704
nl Wikikids 11,319
oc Vikidia 7
pt Vikidia 598
ru Vikidia 104
scn Vikidia 11

Model

edit

We represent out task as a binary classification problem: The simple version of the text is labeled as 0 (negative), difficult text as a 1 (positive). The model is trained to classify whether an article contains the simple or difficult label. The output probability can be interpreted as a readability score.

 
Distributions of length of texts for easy and hard texts (with a limit of 3000 chars)

For our task we are using bert-base-multilingual-cased model. It is a pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. We tune this model for binary classification task, using the training part of dataset observed previously. It is important to mention that tuning of multilingual model is performed using only English texts.

We fine-tune the Masked Language Model (MLM) for binary classification task. Instead of taking the whole text as an input, we decided to work on sentence level. We treat each sentence as an independent sample with the label of corresponding text. During inference, we are getting predictions for each independent sentence and use mean pooling to get the prediction for the whole text. There are few reasons for such decision: (i) the full text can be long, and as a result not fit into memory while tuning; (ii) the length of text represent unwanted leakage to the model, as hard and easy texts have different length distributions.


Even though we train data for sentence level prediction, we do the evaluation on text level as it represents the model usage scenario. We use a simple accuracy metric for our solution evaluation. It is well interpretable, and we can use it as we have perfectly balanced data, where each hard sample has the corresponding easy text.

Training and test split

edit
 
Example pair for the same article with different readability levels (easy, hard)

Training data consist of pairs of texts that correspond to the articles in Simple English Wikipedia and English Wikipedia. We treat one of the texts in a pair as a simple (easy) and another as a difficult (hard). Each text represented as a list of sentences.

We use such a data for our experiment. We split data in three parts: train (80%), validation (10%), test (10%). Important detail is that we include different versions of the article only to one data part (train, test or validation).

Apart from holdout dataset we evaluate model performance on other languages (see above). However, we train only on English text, as there are no enough samples to use all mentioned languages for training.

 
Train and Test split for readability model keeping pairs of sentences either in training or in test set

Evaluation metric

edit

Having balanced dataset (the same quantity for hard and easy samples, as they represent pairs) we are using accuracy as the main metric for the evaluation. Also, we are using AUC as an additional metric for evaluation.

Results

edit

Model evaluation

edit

We evaluate the model using accuracy and AUC for each dataset in the different languages (model was only trained using the 80% sample of the training data of Simplewiki in English).

We can conclude that the best model performs on English, which is the language, that was used in training. However, having multilingual base model, we also generalize the knowledge, so the model performs not random on other languages. For example, we have 0.73 for fr, 0.72 for nl, 0.76 for it. The worst performance is for eu (Basque) and el (Greek).

Language Dataset accuracy AUC
en Simplewiki (test set) 0.891352 0.955451
en Vikidia 0.921013 0.982656
ca Vikidia 0.860656 0.914270
de Klexikon 0.757636 0.948942
de Vikidia 0.690476 0.872446
el Vikidia 0.524390 0.761154
es Vikidia 0.702041 0.822553
eu Txikipedia 0.425975 0.386073
eu Vikidia 0.579792 0.611134
fr Vikidia 0.731558 0.826539
hy Vikidia 0.535455 0.695755
it Vikidia 0.763791 0.856777
nl Wikikids 0.715346 0.788743
oc Vikidia 0.571429 0.795918
pt Vikidia 0.811037 0.908483
ru Vikidia 0.701923 0.837555
scn Vikidia 0.636364 0.752066


One more observation is that model performance is different for samples of different length (number of sentences).  The table represents the accuracy for samples of different length of texts in number of sentences. The longer the text the more accurate are our predictions.

 


We also observe that the readability scores from the multilingual BERT model separate articles from the simple (0) and difficult (1) class. In fact, we see that the separation is more pronounced that using standard Flesch-Kincaid readability formula for English

 
Distribution of Flesch Kincaid scores for the articles with different readability levels: simple (0), difficult (1).
 
Distribution of mutlingual BERT scores for the articles with different readability levels: simple (0), difficult (1).

Comparison to language-agnostic model

edit

Comparison to the language-agnostic model (we re-ran the language-agnostic model with the same train-test splits and evaluation data).

  • The multilingual BERT model substantially improves the accuracy of the language-angostic model; the only exception are the two datasets in de (German)
  • The multilingual BERT model supports more languages. Some languages (el: Greek, eu: Basque, hy: Armenian, oc: Occitan, scn: Sicilian) are currently not supported by the language-agnostic model because we do not have a working entity-linking model generating the language-agnostic representations.
Language Dataset acc multilingual BERT acc Language-agnostic
en Simplewiki (test set) 0.891 0.763
en Vikidia 0.921 0.863
ca Vikidia 0.860 0.809
de Klexikon 0.757 0.823
de Vikidia 0.690 0.793
el Vikidia 0.524 -
es Vikidia 0.702 0.680
eu Txikipedia 0.425 -
eu Vikidia 0.579 -
fr Vikidia 0.731 0.637
hy Vikidia 0.535 -
it Vikidia 0.763 0.741
nl Wikikids 0.715 0.562
oc Vikidia 0.571 -
pt Vikidia 0.811 0.809
ru Vikidia 0.701 0.610
scn Vikidia 0.636 -

Additional experiments

edit

We also performed the various experiments aiming to improve the performance of the model that we observe here but dont show detailed results.

  • We also tried out the finetuning MLM with longformer architecture, that allows to pass the whole text as an input instead of per-sentence prediction. It showed comparable performance to the sentence-based approach. However, we decided not to proceed with it, as the model is more difficult to maintain, needs more memory and not show boost in performance.
  • Another experiment was based on attempt to extend the training dataset with translated texts. As we don’t have enough date for most of languages except English, the idea was to artificially create such a dataset, tune the model based on it and evaluate the results based on testing dataset of real texts. The results showed that adding translated texts decrease performance on original texts (English), with no reliable increase of performance on languages of translated texts. We used facebook/m2m100_418M for texts translation.
  • One more experiment was adding a small amount of original texts from other languages to the train. It shoved the slightly increase of performance for added languages. However, we decided not to proceed with this experiment, as we don’t really have data for such model training procedure.

Resources

edit