Research:Multilingual Readability Research/Evaluation language agnostic
We develop and evaluate a language-agnostic model for assessing the readability of Wikipedia articles:
- While there are many different ways in which to operationalize readability, we start from standard readability formulas which are known to capture at least some aspects of readability (such as the Flesch-Reading ease in for articles in English Wikipedia[1]). Our focus is to adapt these approaches to more languages by using language-agnostic features.
- We build a binary classification model that predicts the annotated level of readability of an article (easy or difficult). The model’s prediction score (between 0 and 1) can be interpreted as a normalized readability score.
- We train the model only in English where we have sufficient ground truth data from Simple Wikipedia and English Wikipedia. We test the model in other languages using annotated ground truth data from Wikipedia and corresponding children’s encyclopedias. This reveals how well the model generalizes to other languages without re-training or fine-tuning in the absence of ground truth data (which is the case for the majority of languages in Wikipedia).
In summary, we find that the language-agnostic model constitutes a promising approach for obtaining readability scores of articles in different languages without language-specific fine-tuning or customization.
- The language-agnostic approach is less precise than the standard readability formulas for English
- The language-agnostic approach generalizes better to other languages than the standard readability formulas (non-customized).
- The language-agnostic approach performs similar or almost as good as the customized versions of the standard readability formulas for most languages (noting that such customizations exist only for very few languages).
Datasets
editWe compile a dataset of documents which are annotated with different readability levels in different languages. Specifically, each dataset consists of pairs of aligned articles with two assigned readability levels which we denote “easy” and “difficult”.
Simple English Wikipedia (SEW)
editThis dataset consists of articles from Simple Wikipedia (easy) and English Wikipedia (difficult) following similar previous approaches[2]. While this dataset is monolingual (a “simple” Wikipedia only exists in English), this is a relatively large aligned corpus with over 200k articles in simplewiki (overlapping considerably with the more than 6M articles in enwiki).
Using the 2021-11 snapshot, we match pairs of common articles from simplewiki and enwiki via their Wikidata-item (main namespace, no redirect). We remove pairs of articles where either article is a disambiguation page or a list page. We extract the text of an article from its wikitext removing any markup and non-text elements. We only keep the lead section of each article in order to avoid that pairs of articles will differ too much in their length. We split text into sentences keeping only pairs where both articles consist of at least 3 sentences. The final dataset consists of 100,454 pairs of articles.
Vikidia Wikipedia (VW)
editThis dataset consists of articles from Vikidia (easy), a children’s encyclopedia, and Wikipedia (difficult) following similar previous approaches[3]. Vikidia exists in 12 languages with the number of articles varying from 35k in French to less than 100 in Greek: French (fr), Italian (it), Spanish (es), English (en), Basque (eu), Armenian (hy), Portuguese (pt), German (de), Catalan (ca), Sicilian (scn), Russian (ru), and Greek (el).
For each language, we identify the common articles between Vikidia and Wikipedia in the following way. We first extract all article-titles from Vikidia via the public API (e.g. for English Vikidia). Second, we match the article titles from Vikidia with article titles from the 2021-11 snapshot of Wikipedia. We also take into account matches between titles from any redirect-pages to the corresponding articles since spelling of the actual articles sometimes slightly differs, e.g., UEFA_Euro_2016 (enwiki) vs. Euro_2016 (enviki). We only keep the pair if the two articles can be matched unambiguously (the title and its redirects cannot be matched to any other article with its corresponding redirects).
We extract the text of an article from its wikitext removing any markup and non-text elements. We only keep the lead section of each article in order to avoid that pairs of articles differ too much in their length. We split text into sentences keeping only pairs where both articles consist of at least 3 sentences.
The final datasets have the following number of pairs of articles:
- French (fr): 12,153
- Italian (it): 1,650
- Spanish (es): 2,337
- English (en): 1,789
- Basque (eu): 1,045
- Armenian (hy): 10
- German (de): 263
- Catalan (ca): 236
- Sicilian (scn): 12
- Russian (ru): 98
- Portuguese (pt): 41
- Greek (el): 40
Klexikon Wikipedia (KW)
editThis dataset consists of articles from Klexikon (easy), an encyclopedia for children aged 6 to 12 years in German, and German Wikipedia (difficult). We use an existing publicly available dataset [4]. This dataset contains 2,898 aligned documents in German, and is thus substantially larger than the corresponding Vikidia data. We only keep the lead section of each article.
Readability features
editStandard readability formulas
editThere are several formulas for calculating readability or reading level of text. We consider the following readability formulas[5].
- Flesch reading ease (FRE) is one of the most widely used formulas with higher scores indicating that the text that is easier to read
- Flesch-Kincaid Grade: This formula represents the number of years of education generally required to understand a given text
- Dale-Chall Readability Score is based on a manually curated list of difficult words
- Gunning-Fog Index leverages the notion of complex words defined as words with at least three syllables
- Smog Index is a modification of the Gunning Fog index and also uses complex or polysyllabic words in its formula
- Automated Readability Index is the only formula that take characters into account rather than syllables
Unlike FRE, for all other metrics, the higher the score, the harder the text is to read.
The formulas were designed specifically for English texts. There are attempts to adapt these formulas to other languages, but these customizations exist only for few cases. We use the customization of FRE for 5 languages described here:
Language | Name | Base | Sentence length | Syllables per word |
---|---|---|---|---|
EN | Flesch Reading Ease | 207 | 1.015 | 73.6 |
FR | 206.835 | 1.015 | 84.6 | |
DE | Toni Amstad | 180 | 1 | 58.5 |
ES | Fernandez Huerta Readability Formula | 206.84 | 1.02 | 0.6 |
IT | Flesch-Vacca | 217 | 1.3 | 0.6 |
RU | 206.835 | 1.3 | 60.1 |
We calculate all formulas with the python package textstat.
Language-agnostic features
editIn order to extract language-agnostic features, we first represent sentences as a sequence of entities. For this, we need an entity-linker to annotate each sentence with its corresponding entity. Each entity consists of an entity-mention (the text) and the entity-id (e.g. its wikidata-item). In practice, we use DBPedia-spotlight (via the python-library spacy-dbpedia-spotlight) which provides entity-linking models for 17 languages. Each entity consists of an entity-mention (the text) and the entity-id (e.g. its wikidata-item). In addition, the model provides a confidence score for each annotated entity. In order to increase the recall of the entities, we choose the lowest possible threshold for the confidence score.
Given the representation of a sentence as a sequence of entities, we define the following quantities:
- Number of tokens : total number of entities in a sentence
- Number of types: the number of unique entities in a sentence in terms of their entity-mentions (types_str) or their entity-ids (types_str)
As an example, consider the following sentences (adapted from [6]) with annotated entities underlined:
1. Attention-deficit hyperactivity disorder (ADHD), or attention deficit disorder (ADD), is a neurodevelopmental disorder.
2. People with ADHD may be too active. 3. ADHD is called a neurological developmental disorder because it affects how people's nervous systems develop.
4. ADHD is most common in children: fewer adults have ADHD.
sentence | Total # of entities (tokens) | # of different entities (types_ids) | # of different mentions (types_str) | Type-token ratio (ids) | Type-token ratio (str) |
---|---|---|---|---|---|
1 | 5 | 2 (ADHD, neuro) | 5 (ADHD, ADD, attention-deficit hyperactivity, attention.., neuro…) | 2/5 | 5/5 |
2 | 1 | 1 | 1 | 1/1 | 1/1 |
3 | 3 | 3 | 3 | 3/3 | 3/3 |
4 | 2 | 1 | 1 | 1/2 | 1/2 |
From this, we extract the following language-agnostic features:
- Average sentence length in tokens
- Average sentence length in types (types_ids)
- Average sentence length in types (types_str)
- Average sentence type-token ratio (types_ids)
- Average sentence type-token ratio (types_str)
- Document type-token ratio (types_ids)
- Document type-token ratio (types_str)
Results
editExploratory Analysis
editBefore we develop the model we want to gain some intuition around the different readability features across the different languages.
What are the scores of the different features for easy and difficult texts?
editWe plot the distributions of the individual features comparing easy and difficult articles.
Observations
- for the standard readability features, the distributions for easy (simple) and difficult (en) texts are systematically shifted (the differences are statistically significantly)
- for the language agnostic features, we also observe systematic differences between easy and difficult. Most notably, the average sentence length is higher for en than for simple.
- Similar observations hold across all considered languages.
Can the individual features distinguish between an individual pair of an easy/hard article?
editThis is different from plotting the distribution, instead we are checking whether the feature can distinguish the easy and difficult version of an individual pair of articles. This corresponds to an “unsupervised classification”. Specifically, for any corresponding pair of articles across different languages and datasets, for the readability and language-agnostic metrics, we assign the ‘correct’, i..e, the more readable, label to the article that scores higher for that metric. Ties are broken randomly. In the following table, we report the average correct score across all articles for the WV dataset disaggregated by language. The higher the score, the better that metric is at differentiating readability for paired articles. Observations:
- The customized Flesch-reading ease formula has the highest score in correctly assigning a pair of articles to the easy and difficult class, respectively.
- The non-customized version of the FRE can distinguish with substantially lower accuracy in some languages
- The language-agnostic features are consistent across languages. It is worse than the customized FRE. For some languages it is comparable or better than the non-customized FRE.
Feature | ru | es | fr | en | de | it | de (KW) |
---|---|---|---|---|---|---|---|
flesch_reading_ease | 0.755 | 0.842 | 0.839 | 0.921 | 0.829 | 0.785 | 0.987 |
flesch_reading_ease [non-custom] | 0.724 | 0.757 | 0.835 | 0.92 | 0.817 | 0.588 | 0.987 |
avg_sent_length_tokens_lang_agn | 0.633 | 0.746 | 0.721 | 0.791 | 0.734 | 0.819 | 0.948 |
avg_sent_length_entity_types | 0.643 | 0.734 | 0.72 | 0.781 | 0.707 | 0.817 | 0.939 |
avg_sent_length_mention_types | 0.643 | 0.746 | 0.721 | 0.789 | 0.703 | 0.816 | 0.948 |
sent_type_token_ratio_entity | 0.531 | 0.64 | 0.567 | 0.663 | 0.726 | 0.628 | 0.744 |
sent_type_token_ratio_mention | 0.551 | 0.62 | 0.565 | 0.641 | 0.586 | 0.604 | 0.578 |
doc_type_token_ratio_entity | 0.622 | 0.645 | 0.545 | 0.679 | 0.51 | 0.679 | 0.807 |
doc_type_token_ratio_mention | 0.592 | 0.641 | 0.539 | 0.696 | 0.589 | 0.659 | 0.89 |
How similar are the different features?
editWe calculate the spearman rank correlation between the scores of different features across all articles in a dataset.
Observations:
- Standard readability formulas are all strongly correlated (0.53…0.74)
- Standard readability formulas are medium-correlated to the language-agnostic features (0.24…0.52). The reason is that standard readability formulas are correlated to simplistic features such as average sentence length in word-tokens (0.32…0.52). This in turn is correlated to the average sentence length in entity-tokens (0.4…0.42)
Classification model
editThe aim is to develop a binary classifier that predicts whether an individual text belongs to the easy or the difficult class. The advantage in comparison to the individual formulas described above is that we can combine different features to distinguish between easy and difficult texts. In fact, the output of the model will be a score between 0 (easy) and 1 (difficult) that can be interpreted as a normalized readability score.
We train the model using the English SEW dataset and split the dataset into a 70% training and 30% test set (specifically, we split pairs of articles such that the easy and the difficult version of an article both appear in the training or in the test set). Despite being only available in English, the SEW dataset is by far the largest amount of samples for training the model.
We will evaluate the model on the other languages (VW and KW) without further training or fine-tuning of the model for that specific language. The reason is that there is no ground truth data of articles annotated with readability levels for the vast majority of the 300+ languages in Wikipedia. Thus, in practice, it will not be feasible to fine-tune the models for the specific languages. Therefore, evaluating on the multilingual datasets (VW and KW) indicates how well the trained model generalizes to predict readability levels in other languages for which there is no training data (which is the case for almost any other language). In addition, the availability of the VW dataset in English shows how well the trained model generalizes to another corpus. As an evaluation metric we use the accuracy-score (fraction of correctly predicted samples) since the two classes are perfectly balanced. Results report mean and standard error over 5 predictions of each sample in the test set.
We train three different models: Logistic Regression (LR), Linear Support Vector machine (SVM), and Random Forest (RF). We select hyperparameters via grid-search picking the model that performs best on the SEW test-data averaged over 5 different 70-30 splits.
Below we report results for all datasets (SEW, VW, KW) comparing different sets of features:
- Language-agnostic features. Only entity-based features without language-parsing.
- Readability formula (FRE for English). The standard readability formula that we could easily apply for other languages too. Thus, this constitutes a good comparison to the language-agnostic model because it does not require any language-specific tuning.
- Customized readability formula (FRE for the specific language). These are readability formulas, where the coefficients are fine-tuned to the corresponding language. This is, in general, not applicable for most languages as fine-tuned formulas do not exist.
Summary:
- Language-agnostic works: It provides a slightly less precise measure than the standard readability formula used for English but generalizes much better to other languages without any fine-tuning
- For the English datasets, the language-agnostic performs worse than standard readability formula (both SEW and VW)
- For the non-English datasets, the language-agnostic model performs better than the non-customized readability formula (except French where the latter is slightly better). This is the typical situation we compare against in the context of Wikipedia because for the majority of Wikipedia’s languages we do not have such a customization.
- For some of the non-English datasets, the language-agnostic model performs even better or similar to the customized readability formulas. However, with the lack of annotated training data available for most of Wikipedia’s languages, we cannot expect such customizations
- For the language-agnostic model, the Random Forest constitutes the best model for most cases. The accuracy is between 61% (VW-ru) and 76% (VE-en).
dataset | lang | features | LR | Linear SVM | RF |
---|---|---|---|---|---|
SEW | en | language agnostic | 0.66 ± 0.0 | 0.66 ± 0.0 | 0.661 ± 0.0 |
readability | 0.712 ± 0.0 | 0.712 ± 0.0 | 0.763 ± 0.0 | ||
VW | en | language agnostic | 0.678 ± 0.0 | 0.678 ± 0.0 | 0.765 ± 0.002 |
readability | 0.779 ± 0.0 | 0.75 ± 0.057 | 0.756 ± 0.1 | ||
es | language agnostic | 0.654 ± 0.0 | 0.654 ± 0.0 | 0.685 ± 0.001 | |
readability | 0.5 ± 0.0 | 0.5 ± 0.0 | 0.5 ± 0.0 | ||
readability_custom | 0.665 ± 0.0 | 0.667 ± 0.004 | 0.639 ± 0.005 | ||
fr | language agnostic | 0.62 ± 0.0 | 0.618 ± 0.0 | 0.63 ± 0.0 | |
readability | 0.67 ± 0.0 | 0.679 ± 0.017 | 0.61 ± 0.06 | ||
readability_custom | 0.649 ± 0.0 | 0.651 ± 0.004 | 0.641 ± 0.03 | ||
it | language agnostic | 0.687 ± 0.0 | 0.683 ± 0.0 | 0.684 ± 0.001 | |
readability | 0.5 ± 0.0 | 0.5 ± 0.0 | 0.5 ± 0.0 | ||
readability_custom | 0.698 ± 0.0 | 0.697 ± 0.001 | 0.632 ± 0.006 | ||
ru | language agnostic | 0.536 ± 0.0 | 0.536 ± 0.0 | 0.606 ± 0.011 | |
readability | 0.515 ± 0.0 | 0.515 ± 0.0 | 0.431 ± 0.014 | ||
readability_custom | 0.582 ± 0.0 | 0.584 ± 0.003 | 0.547 ± 0.015 | ||
de | language agnostic | 0.593 ± 0.0 | 0.587 ± 0.0 | 0.678 ± 0.003 | |
readability | 0.538 ± 0.0 | 0.546 ± 0.016 | 0.56 ± 0.066 | ||
readability_custom | 0.676 ± 0.002 | 0.671 ± 0.008 | 0.593 ± 0.031 | ||
KW | de | language agnostic | 0.747 ± 0.0 | 0.745 ± 0.0 | 0.713 ± 0.002 |
readability | 0.512 ± 0.0 | 0.532 ± 0.04 | 0.607 ± 0.131 | ||
readability_custom | 0.91 ± 0.0 | 0.907 ± 0.006 | 0.646 ± 0.015 |
Or more compact as a bar-plot
References
edit- ↑ Lucassen, Teun; Dijkstra, Roald; Schraagen, Jan Maarten (2012-09-03). "Readability of Wikipedia". First Monday 17 (9). doi:10.5210/fm.v0i0.3916.
- ↑ Napoles, Courtney; Dredze, Mark (2010-06-06). "Learning simple Wikipedia: a cogitation in ascertaining abecedarian language". Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids. CL&W '10 (USA: Association for Computational Linguistics): 42–50. doi:10.5555/1860657.1860663.
- ↑ Madrazo Azpiazu, I., & Pera, M. S. (2020). Is cross‐lingual readability assessment possible? Journal of the Association for Information Science and Technology, 71(6), 644–656. https://doi.org/10.1002/asi.24293
- ↑ Aumiller, D., & Gertz, M. (2022). Klexikon: A German Dataset for Joint Summarization and Simplification. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2201.07198
- ↑ Martinc, Matej; Pollak, Senja; Robnik-Šikonja, Marko (2021-04-21). "Supervised and Unsupervised Neural Approaches to Text Readability". Computational Linguistics 47 (1): 141–179. ISSN 0891-2017. doi:10.1162/coli_a_00398.
- ↑ Štajner, S., & Hulpuș, I. (2020). When shallow is good enough: Automatic assessment of conceptual text complexity using shallow semantic features. Proceedings of the 12th Language Resources and Evaluation Conference, 1414–1422. https://www.aclweb.org/anthology/2020.lrec-1.177/