User:Trokhymovych/drafts/Multilingual readability model card
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Mykola Trokhymovych and Martin Gerlach |
Model owner(s) | Martin Gerlach |
Code | training and inference |
Uses PII | No |
In production? | No |
This model uses article text to predict how hard it is for a reader to understand it. | |
This model generates scores to assess the readability of Wikipedia articles. The readability scores is a rough proxy to capture how difficult it is for a reader to understand the text of the article.
Specifically, we propose a multilingual model using pre-trained mBERT[1]. It supports not all but about 100 languages with the largest Wikipedias.
We fine-tune the model using annotated data of articles available in different readability levels. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We evaluate the model's performance on small annotated datasets available in a few languages using different children's encyclopedias (such as Vikidia).
Motivation
[edit]As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. One of the goals is to identifying metrics to quantify the size of these gaps. This model attempts to provide a metric to measure readability of articles in Wikimedia projects; specifically focusing to provide multilingual support.
While there are readily available formulas to calculate readability of articles (such as the Flesch-Kincaid score), these formulas are often developed for a specific language (most commonly English). Usually, these formulas cannot be applied out of the box to other languages. As a result, it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.
You can find more details about the project here: Research:Multilingual Readability Research
Users and uses
[edit]- Define the readability score of Wikipedia article revision
- Define the Flesch–Kincaid score of the article in multilingual setup
- Compare readability of different revisions of the same article
- Making predictions on language editions of Wikipedia that are not in the listed languages or other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
- Making predictions on namespaces outside of 0, disambiguation pages, and redirects
Ethical considerations, caveats, and recommendations
[edit]The model only uses publicly available data of the content (i.e. plain text) extracted from the articles.
Nevertheless, there are certain caveats:
- Multitlingual support: The model has only been trained on English data annotated with different readability levels. Our evaluation shows that the resulting model also works for other languages. However, performance varies across languages (see below). While this is a known issue for mBERT more generally [2], in the context of readability we are unable to systematically evaluate the model for many supported languages due to the lack of ground-truth data. In order to address this issues, we have started a research project to manually evaluate the model based on readers' perception of readability through surveys (ongoing).
Model
[edit]The presented system is based fine-tuned language model mBERT[3] along with CatBoost regressor [4] as a Flesch–Kincaid scoring model. It is built in a paradigm of having one generalized model for all covered languages. The system includes the following steps:
1. Text features preparation:
- Process wikitext and extract the revision text
- Split text into sentences.
2. Masked Language Models (MLM) outputs extraction:
- Pass each of the sentences to the pre-trained classification model
3. Final scores extraction
- Apply mean pooling to the list of scores to extract the final unified readability score. This corresponds to a binary classification score of whether article should be annotated with one of the two levels of readability (easy or difficult).
- Apply the Flesch–Kincaid scoring model on top of sentence scores. This score corresponds to a predicted Flesch-Kincaid grade level, i.e. a U.S. grade level capturing roughly "the number of years of education generally required to understand this text", that can be applied to other languages. The motivation is to provide a more interpretable score as an alternative to the binary classification score.
Performance
[edit]We evaluate the model on a binary classification task. As for the model probability output, we use the mean pooling of sentences MLM scores, and to get the binary label from it, we use the threshold of 0.5.
The testing data consist of pairs of texts that correspond to the simple (easy) and difficult (hard) versions of one article (for example, the same article from English Wikipedia and Simple English Wikipedia). Even though we train the model only on English texts, we evaluate performance in other languages. We evaluate model performance using AUC and Accuracy metrics.
testing set | Accuracy | AUC |
---|---|---|
simple-en-test | 0.891352 | 0.955451 |
simple-en-validation | 0.893358 | 0.955407 |
klexikon-de | 0.757636 | 0.948942 |
vikidia-ca | 0.860656 | 0.914270 |
vikidia-de | 0.690476 | 0.872446 |
vikidia-el | 0.524390 | 0.761154 |
vikidia-en | 0.921013 | 0.982656 |
vikidia-es | 0.702041 | 0.822553 |
vikidia-eu | 0.579792 | 0.611134 |
vikidia-fr | 0.731558 | 0.826539 |
vikidia-hy | 0.535455 | 0.695755 |
vikidia-it | 0.763791 | 0.856777 |
vikidia-oc | 0.571429 | 0.795918 |
vikidia-pt | 0.811037 | 0.908483 |
vikidia-ru | 0.701923 | 0.837555 |
vikidia-scn | 0.636364 | 0.752066 |
wikikids-nl | 0.715346 | 0.788743 |
txikipedia | 0.425975 | 0.386073 |
Implementation
[edit]mBERT model tunning:
- Learning rate: 2e-5
- Weight Decay: 0.01
- Epochs: 5
- Maximum input length: 512
- Number of encoder attention layers: 12
- Number of decoder attention layers: 12
- Number of attention heads: 12
- Length of encoder embedding: 768
CatBoost:
- Iterations: 5000
- Learning Rate: 0.01
- Loss: RMSE
{
lang: <language code string>,
rev_id: <revision_id string>,
score: {
prediction: <boolean decision result>
probability: <probability of being hard to read>
fk_score: <Flesch–Kincaid score approximation>
}
}
Example input:
curl "https://<endpoint>/v1/models/readability:predict" -X POST -d '{"lang": "en", "rev_id":1161100049}' -H "Host: readability.experimental.wikimedia.org" --http1.1
Experimental endpoint (internal use only): inference-staging.svc.codfw.wmnet:30443
Example output:
{
"model_name":"readability",
"model_version":"2",
"wiki_db":"enwiki",
"revision_id":1161100049,
"output":{
"prediction":true,
"probabilities":{
"true":0.8169194640857833,
"false":0.1830805359142167
},
"fk_score":11.953445079550391
}
}
Data
[edit]Training data consist of pairs of texts that correspond to the articles in English Wikipedia and Simple English Wikipedia. We treat one of the texts in a pair as simple (easy) and another as difficult (hard). Each text is represented as a list of sentences.
We split data into three parts: train (80%), validation (10%), and test (10%). An important detail is that we include different versions of the article only to one data part (train, test, or validation).
Apart from the holdout dataset, we evaluate model performance in other languages. In particular, we use Vikidia pairs for it, oc, el, de, ru, es, en, ca, hy, scn, pt, fr, eu, Klexikon for de, wikikids for nl, Txikipedia for eu.
- Number of samples: 174,642
- Balance of classes: 1:1
- Languages: en
- Number of samples: 119,536
- Balance of classes: 1:1
- Languages: en, de, ca, el, es, eu, fr, hy, it, oc, pt, ru, scn, nl
Licenses
[edit]- Code: Apache 2.0 License
- Model: Apache 2.0 License
Citation
[edit]To be added soon.
- ↑ https://huggingface.co/bert-base-multilingual-cased
- ↑ Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16
- ↑ https://huggingface.co/bert-base-multilingual-cased
- ↑ https://catboost.ai/en/docs/concepts/python-reference_catboostregressor