Jump to content

Research:Multilingual Readability Research/Background Research

From Meta, a Wikimedia project coordination wiki

This page aims to provide an overview on existing approaches to measuring Readability.

What is readability?

[edit]

There are several excellent reviews on the topic [1] [2] [3] [4]

Readability aims to capture how hard it is for a reader to understand a written text. Text readability has been more formally defined as the sum of all elements in textual material that affect a reader’s understanding, reading speed, and level of interest in the material [5]. Here we focus on characteristics related to the content of the text in terms of its complexity such as the vocabulary, syntax, or length of sentences, etc.

One of the main motivations to assess readability of a text is to match texts to the reading abilities of a specific audience; for example schoolchildren (or degree of education more generally), learners of a foreign language, or readers with intellectual disabilities. Thus it is extremely important to meet people’s information needs. For example, choosing appropriate reading materials for language teaching, supporting readers with learning disabilities, self-directed learning, or information retrieval . A common practical case is medical research, where it was used, for example, for assessing patient education materials and consent forms.

Some common frameworks here are:

In Wikipedia, the issue of readability has been discussed extensively, for example:

  • Simple Wikipedia was launched in 2001 and is primarily written in Basic English and Learning English. The site has the stated aim of providing an encyclopedia for "people with different needs, such as students, children, adults with learning difficulties, and people who are trying to learn English"
  • (Lucassen et al., 2012)[6] found that the overall readability in English Wikipedia is poor. In contrast, Simple Wikipedia has better readability but is still insufficient for the target audience
  • (Brezar&Heilman, 2019)[7] found that most of the health information remains written at a level above the reading ability of average adults

Some other general pointers:

How to measure readability?

[edit]

Features

[edit]

There are different features capturing different aspects of readability. One possible hierarchical organization is the following with increasing complexity [1]

  • Text legibility, e.g. font, formatting, spacing
  • Lexico-semantic:
    • difficulty or unfamiliarity with the vocabulary (e.g. presence in a list or the relative frequency
    • Type-token-ratio
    • Statistical language model: can be considered as a lexical feature
    • cognitive/psycholinguistics-based: lexical feature such as degree of polysemy of a word
  • Morphological: rare or more complex morphological articles
  • Syntax: grammatical structure. Associated with longer processing times in comprehension. Can be done through a language-parser to capture properties of the parse-tree
  • Discourse: cohesion and organization These attempts aim to capture organizational aspects of the text such as coherence (e.g. how similar are the sentences in a semantic space) or “discourse cohesion”. One tool to calculate such features is Coh-Metrix. (the original code is not open, though there seem to be some open re-implementations in python). “Although semantic and discourse predictors appear very appealing, they do not seem to make much difference in some cases, or might even be redundant with more basic lexical variables.” [2]
  • Higher-level semantics: domain or world knowledge required to understand
  • Pragmatics: e.g. sarcasm

Readability formulas

[edit]

Development of readability formulas goes back to the 1920s. There are hundreds of these formulas. Some examples include:

Most of them are based on simple lexical, lexico-semantic, or syntactic features :

  • Vocabulary such as type-token-ratio
  • Word length, e.g. number of syllables per word
  • Sentence length, e.g. number of words in a sentence

One advantage of these formulas is that they are relatively easy to implement. However, they thus capture surface characteristics of the text and ignore deeper features. While heavily used, they were also criticized for i) not capturing more complex aspects such as cohesion, coherence, or macro-structural organization, ii) the fact that they ignore interactive dimension of the reading process, iii) over-generalization of the tools to populations different from that for they were designed [2]. As a result  of  these  limitations,  the  validity  of  traditional  readability  formula predictions of text comprehensibility is often suspect.

The formulas are often designed for specific use in English. For other languages they often need to be adapted or there are attempts to build new formulas (some pointers on readability formulas for other languages are contained in Ref. [1]). For example, some studies have pointed out that cross-lingual  readability  prediction  with  shallow  readability indicators is problematic [4]: "For example, if we compare the Newsela corpus and Slovenian SB corpus, which both cover roughly the same age group, we can see that for some readability indicators (FRE, FKGL, DCRF, and ASL) the values are on entirely different scales”

Multilingual approaches

[edit]

Neural networks and languages models

[edit]

This approach falls into the language-dependent class for NLP modeling approaches. These models do not try to model individual words to derive features and thus do not require language-specific parsing, but rather treat the text as a stream of characters. The main reference is (Martinc et al 2021) [4] but there are several studies [8][9][10][11][12] that show that BERT (or similar)-based features (from sentences or documents) are similar or better than using hand-crafted linguistic features (e.g. those used in the readability formulas). Some of the used models such as multilingual BERT support more and more languages. There is evidence that transfer-learning yields good results, both, for applying a trained model to a different corpus and to a different language for which one didnt have labeled data.

Entity-based.

[edit]

This approach falls into the language-agnostic class for NLP modeling approaches. We represent the sentence as a sequence of entities using an entity linker. In this way we can avoid parsing of text altogether and end-up with a language-agnostic representation. This will require a good entity-linker, but there are many existing open models around such as dbpedia-spotlight (which is open/free). The nice aspect would be that we could train a supervised classifier on labeled data obtained in one language and apply the model to other languages as well without additional training since the data is represented in a language-agnostic way. The main idea is sketched in (Štajner & Hulpuș 2020) [13] which shows that shallow features extracted from a text represented as entities is able to capture many aspects of readability. The original idea was to capture conceptual complexity of the text by taking into account the relationship between entities in a knowledge graph [14] . There is even an open tool called CoCo (conceptual complexity) [15] (code on github), where they use dbpedia-spotlight as an entity-linker. Interestingly, it is shown that scores from CoCo using deeper semantic features cannot differentiate well between texts from English and Simple Wikipedia which they claim is expected. In contrast, more shallow features such as average sentence length differ substantially across the two datasets.

Therefore, the main advantage of the entity-based approach is to obtain a language-agnostic representation of the text in order to avoid parsing individual words or sub-components such as syllables which are not easily applicable for many languages. This representation can then be used to obtain shallow features (such as number of unique entities per sentence) capturing some aspect of readability in a language-agnostic way.

Translation

[edit]

In this approach one applies traditional readability formulas (developed for English) on texts from other languages translated to English [16][17]. It seems that the readability level predictions for translations of the same text are rarely consistent when using these formulas. Therefore this approach is not recommended.

Datasets

[edit]

What are available datasets that contain texts with labels of readability that we can use for training models?

Simple Wikipedia

[edit]

One common approach is to compare the same articles in English and Simple Wikipedia. This yields a set of parallel texts with two labels: simple (supposedly easier to read) and english (supposedly harder to read). There are several papers that have built corpora [18][19][20]. Some of them can be readily downloaded on the level of both, documents and sentences. The main advantage is the large size of the corpus in comparison to other corpora (~60k documents and 176k sentences). This corpus only contains English texts.

Links to download a corpus:

VikiWiki

[edit]

This corpus compares articles from Wikipedia and Vikidia (Vikidia User Group). Vikidia is a children encyclopedia created by Wikipedians from all around the world to address lack of content adapted for children in the Wikimedia movement. At the moment there are 12 language versions: english, french, italian, spanish, portugues, german, russian, catalan, greek, armenian, sicilian, basque. Therefore, a main advantage is that the corpus captures several languages. A corpus was developed for 6 languages [17] (English, Spanish, French, Italian, Catalan, and Basque), with 448 articles for each language and each reading level (viki and wiki, respectively).

Links to download the corpus:

Other

[edit]
  • OneStopEnglish [21] is one of the most-used corpora for automatic readability assessment in English. It contains 189 texts in English written each in 3 different reading levels (567 texts in total). The advantage is that the corpus is openly available (github)
  • Newsela [22] contains around 2000 articles with 4 levels of simplification (5 readability levels with in total 10k articles). The data is not openly available, but can be requested at https://newsela.com/data/
  • Weebit [23] contains 625 articles for each of the 5 classes capturing different age-groups (7-8,8-9,9-10,10-14,14-16). The data is not openly available.

Readability in Wikipedia

[edit]

Simple Wikipedia contains specific instructions how to write simple English Wikipedia. An episode on dataskeptic describes how this leads to differences when comparing articles between English and Simple Wikipedia using Flesch-Kincaid.

Phabricator mentions readability:

  • Create tool to help improve readability for a given wiki page T91338.

Templates mention readability:

WikiProjects mention readability:

  • WikiProject Climate change suggests as one task to improve the readability: "Did you know many readers only read the first paragraph (the 'lede' or 'lead' section) of Wikipedia articles? Getting the lead right is thus an important task. Check out this list of articles where the article introduction needs to be rewritten. The lead section should provide a concise overview of the topic, and summarize the different parts of it. There are some detailed style guidelines that cover formatting. A common problem with lead sections is that they don't mention all the topics that are discussed in the body of the article, or that they are not very readable. Improve readability or grammar of articles - if you find an article that is hard to understand because of the writing, or has poor grammar, this is an improvement you can make without knowing much about the subject."

Papers investigate readability of articles on Wikipedia:

  • Brezar&Heilman 2019[7] compare readability scores of health-related articles in English Wikipedia via Gunning-fog, Flesch-Kincaid, SMOG, Flesch Reading ease test. They find that most of the health information remains written at a level above the reading ability of average adults.
  • Lucassen et al. 2012 [6] compare English and Simple Wikipedia through Flesch reading ease test. They compare only the same articles and having at least 5 sentences (readability score stabilizes only for articles with more than 5 sentences). They find that overall readability is poor in English; Simple has better readability score but still insufficient for target audience.
  • Napoles&Dredze 2010 [24] use articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains (Features: tokens, types, sentences, avg sentence length, type-token-ratio, % simple words, etc).
  • Park et al 2015 [25] evaluate the complexity of edits of multilingual editors (before vs after) through, e.g., #chars, #words, #sentences, #unique-words, avg sentence length, word-frequency entropy. Majority of edits in non-primary language are short and simple though many of them are just as long and complex as the primary language edits.
  • Reavley et al. 2012 [26] use Flesch-Kincaid level index to assess readability of mental-health related articles. "Across all topics, Wikipedia was the most highly rated in all domains except readability."
  • Yasseri et al. 2012 [27] compare Simple and English wikipedia through gunning-fog index, vocabulary richness (Herdan-C). They find less frequent use of more complex words, use of shorter sentences in Simple Wikipedia which lead to reduction in gunning-fog index.

Existing tools

[edit]

A non-exhaustive list of tools

References

[edit]
  1. a b c Collins-Thompson, K. (2014). Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2), 97–135. https://doi.org/10.1075/itl.165.2.01col
  2. a b c François, T. (2015). When readability meets computational linguistics: a new paradigm in readability. Revue Francaise de Linguistique Appliquee, Vol. XX(2), 79–97. https://www.cairn.info/revue-francaise-de-linguistique-appliquee-2015-2-page-79.htm?ref=doi
  3. Vajjala, S. (2021). Trends, Limitations and Open Challenges in Automatic Readability Assessment Research. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2105.00973
  4. a b c Martinc, M., Pollak, S., & Robnik-Šikonja, M. (2021). Supervised and Unsupervised Neural Approaches to Text Readability. Computational Linguistics, 47(1), 141–179. https://doi.org/10.1162/coli_a_00398
  5. Dale, E., & Chall, J. S. (1949). The Concept of Readability. Elementary English, 26(1), 19–26. http://www.jstor.org/stable/41383594
  6. a b Lucassen, T., Dijkstra, R., & Schraagen, J. M. (2012). Readability of Wikipedia. First Monday. https://doi.org/10.5210/fm.v0i0.3916
  7. a b Brezar, A., & Heilman, J. (2019). Readability of English Wikipedia’s health information over time. In WikiJournal of Medicine (Vol. 6, Issue 1, p. 7). https://doi.org/10.15347/wjm/2019.007
  8. Deutsch, T., Jasbi, M., & Shieber, S. (2020). Linguistic Features for Readability Assessment. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2006.00377
  9. Imperial, J. M. (2021). BERT Embeddings for Automatic Readability Assessment. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2106.07935
  10. Madrazo Azpiazu, I., & Pera, M. S. (2020). An Analysis of Transfer Learning Methods for Multilingual Readability Assessment. Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, 95–100. https://doi.org/10.1145/3386392.3397605
  11. Meng, C., Chen, M., Mao, J., & Neville, J. (2020). ReadNet: A Hierarchical Transformer Framework for Web Article Readability Analysis. Advances in Information Retrieval, 12035, 33. https://doi.org/10.1007/978-3-030-45439-5_3
  12. Mohammadi, H., & Khasteh, S. H. (2019). Text as Environment: A Deep Reinforcement Learning Text Readability Assessment Model. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1912.05957
  13. Štajner, S., & Hulpuș, I. (2020). When shallow is good enough: Automatic assessment of conceptual text complexity using shallow semantic features. Proceedings of the 12th Language Resources and Evaluation Conference, 1414–1422. https://www.aclweb.org/anthology/2020.lrec-1.177/
  14. Štajner, S., & Hulpuş, I. (2018). Automatic assessment of conceptual text complexity using knowledge graphs. Proceedings of the 27th International Conference on Computational Linguistics, 318–330. https://www.aclweb.org/anthology/C18-1027.pdf
  15. Štajner, S., Nisioi, S., & Hulpus, I. (2020). CoCo: A tool for automatically assessing conceptual complexity of texts. In N. Calzolari (Ed.), LREC 2020 Marseille : Twelfth International Conference on Language Resources and Evaluation : May 11-16, 2020, Palais du Pharo, Marseille, France : conference proceedings (pp. 7179–7186). European Language Resources Association. https://madoc.bib.uni-mannheim.de/55731?rs=true&
  16. Ciobanu, A. M., Dinu, L. P., & Pepelea, F. (2015). Readability assessment of translated texts. Proceedings of the International Conference Recent Advances in Natural Language Processing, 97–103. https://www.aclweb.org/anthology/R15-1014.pdf
  17. a b Madrazo Azpiazu, I., & Pera, M. S. (2020). Is cross‐lingual readability assessment possible? Journal of the Association for Information Science and Technology, 71(6), 644–656. https://doi.org/10.1002/asi.24293
  18. Napoles, C., & Dredze, M. (2010). Learning simple Wikipedia: A cogitation in ascertaining abecedarian language. Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids, 42–50. https://www.aclweb.org/anthology/W10-0406.pdf
  19. Hwang, W., Hajishirzi, H., Ostendorf, M., & Wu, W. (2015). Aligning sentences from standard wikipedia to simple wikipedia. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 211–217. https://www.aclweb.org/anthology/N15-1022.pdf
  20. Kauchak, D. (2013). Improving text simplification language modeling using unsimplified text data. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), 1537–1546. https://www.aclweb.org/anthology/P13-1151.pdf
  21. Vajjala, S., & Lucic, I. (2018). OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. https://doi.org/10.18653/v1/w18-0535
  22. Xu, W., Callison-Burch, C., & Napoles, C. (2015). Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3, 283–297. https://direct.mit.edu/tacl/article-abstract/doi/10.1162/tacl_a_00139/43283
  23. Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, 163–173. https://www.aclweb.org/anthology/W12-2019.pdf
  24. Napoles, C., & Dredze, M. (2010). Learning simple Wikipedia: A cogitation in ascertaining abecedarian language. Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids, 42–50. https://www.aclweb.org/anthology/W10-0406.pdf
  25. Park, S., Kim, S., Hale, S., Kim, S., Byun, J., & Oh, A. (2015). MultilingualWikipedia: Editors of Primary Language Contribute to More Complex Articles. Ninth International AAAI Conference on Web and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10648/10567
  26. Reavley, N. J., Mackinnon, A. J., Morgan, A. J., Alvarez-Jimenez, M., Hetrick, S. E., Killackey, E., Nelson, B., Purcell, R., Yap, M. B. H., & Jorm, A. F. (2012). Quality of information sources about mental disorders: a comparison of Wikipedia with centrally controlled web and printed sources. Psychological Medicine, 42(8), 1753–1762. https://doi.org/10.1017/S003329171100287X
  27. Yasseri, T., Kornai, A., & Kertész, J. (2012). A Practical Approach to Language Complexity: A Wikipedia Case Study [Computation and Language; Computers and Society; Data Analysis, Statistics and Probability; Physics and Society]. PloS One, 7(11), e48386. https://doi.org/10.1371/journal.pone.0048386