Machine learning models/Production/Language-agnostic Wikipedia article quality
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Isaac Johnson |
Model owner(s) | Isaac Johnson |
Model interface | English Wikipedia example |
Past performance | documentation |
Code | Gitlab |
Uses PII | No |
This model uses the structure and size of an article to predict quality scores for Wikipedia articles. | |
This model card describes a model for predicting the quality of Wikipedia articles. It uses structural features extracted from the article and a simple set of weights and wiki-specific normalization critera to label Wikipedia articles in any language with a score between 0 and 1 (that can then be mapped to more recognized article quality classes such as Stubs). These scores are relative to a given language edition (not directly comparable across languages). The weights and feature selection were trained on editor asssessments from Arabic, English, and French Wikipedia. This model is a prototype and may still be substantially updated.
Motivation
[edit]Wikipedia articles range in quality from rich, well-illustrated, fully-referenced articles that fully cover their topic and are easy to read to single sentence stubs that define the topic of the article but do not offer much more information. It is very useful to be able to reliably distinguish between these extremes and the various stages of quality along this spectrum. Wikipedia editors have developed rich rubrics for how to evaluate the quality of Wikipedia articles and are constantly assessing article quality to assist in coordinating work on the wikis (English Wikipedia example). Editors use these quality scores to evaluate and prioritize their work. Researchers use these quality scores to understand content dynamics. Developers use these quality scores as filters when building recommender systems or other tools.
Wikipedia is ever-changing though, which makes it time-consuming (and largely impossible) for editors to keep these quality assessments complete and up-to-date. An automatic quality model can help fill these gaps by evaluating the quality for articles that are unassessed or have changed substantially since they were last assessed. In doing so, it can provide researchers and tool developers with more consistent data and even potentially help editors identify articles that would benefit from a human assessment. Initial models were language-specific, which allowed them to be finely-tuned to the dynamics and existing quality classes of a particular language edition. This model approach is language-agnostic (works for all Wikipedia language editions). The model may require further fine-tuning for a given community to better align its scores with existing quality classes, but this approach ensures that all language editions, even those lacking their own quality assessment schema, can benefit from these labels.
Users and uses
[edit]- high-level analyses of article quality trends (example)
- filtering / ranking articles in tools – e.g., only show low-quality articles in a recommender system
- identifying potential ways to improve articles – e.g., using the lowest-value feature from the model as a recommendation
- projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
- namespaces outside of 0, disambiguation pages, and redirects
- directly comparing article quality across language editions – the scores are mostly relative to a given Wikipedia so e.g., an article that received a 0.5 score on English Wikipedia would get a much higher score if it had been on Simple English Wikipedia instead (because high-quality articles on English Wikipedia generally have more content than high-quality articles on Simple English Wikipedia)
Ethical considerations, caveats, and recommendations
[edit]- The weights used in this quality model were derived from a groundtruth dataset based on quality assessments made by editors on Arabic, English, and French Wikipedia (using the PageAssessments Extension). The model therefore reflects how editors in these communities weight the value of different aspects of an article, which may or may not extend to other language editions.
- The model does not currently take into account the quality of the specific writing, so a long article with many fake words would register as high quality. It does take into account structure though, so a long article would be penalized if it did not have many sections or was poorly referenced.
- The scores are relative for a given wiki – i.e. the feature scores needed to achieve a high quality prediction vary by wiki. For instance, a high-quality article on English Wikipedia is expected to have at least 3 images while only 2 are required on Swedish Wikipedia (and in fact, less than 5% of articles on Swedish Wikipedia have more than 2 or more images).
- The predicted scores in many wikis skew higher than the groundtruth assessments provided by Wikipedians. Some of this can be tempered by calibrating the thresholds used for mapping the predictions to classes (see the model evaluation for recommended thresholds) but even so, the model seems to be more optimistic than Wikipedians (likely capturing the many articles with a lot of content but that perhaps would benefit from improved readability, background, etc.)
Model
[edit]Performance
[edit]Implementation
[edit]{
lang: <language-code string>,
title: <title string>,
quality: <float [0-1]>
features: {
normalized: {
<feature 1>: <float [0-1]>
...
<feature n>: <float [0-1]>
},
raw: {
<feature 1>: <int [0-]>
...
<feature n>: <int [0-]>
}
}
Input
GET /api/v1/quality-article-features?lang=en&title=Frida_Kahlo
Output
{
"lang": "en",
"title": "Frida_Kahlo",
"quality": 0.9019721559418881,
"features": {
"normalized": {
"categories": 1,
"headings": 0.4688479565973076,
"length (bytes)": 1,
"media": 1,
"references": 0.8304080512730345,
"wikilinks": 0.4578991720468065
},
"raw": {
"categories": 28,
"headings": 19,
"length (bytes)": 123748,
"media": 20,
"references": 86,
"wikilinks": 351
}
}
}
Feature | Weight | Pre-processing | Minimum threshold for top quality[1] |
---|---|---|---|
Page length | 0.395 | Square-root of number of bytes in wikitext | 10000 characters |
References | 0.181 | # ref tags / normalized-page-length | 0.15 (roughly equivalent to 2 refs / section) |
Sections | 0.123 | Number of level 2 and 3 headings / normalized-page-length | 0.1 (1 heading at 100 chars, 2 headings at 400 chars, etc.) |
Wikilinks | 0.115 | Square root of # of wikilinks (ns=0) / normalized-page-length | 0.1 (~1 link per sentence) |
Media | 0.114 | raw count of number of media files – e.g., image, video, audio – in wikitext | 2 |
Categories | 0.070 | raw count of categories in wikitext | 5 |
Data
[edit]The model weights are based on 19,173 articles that whose quality was assessed by editors in December 2021 across English (7,937), French (7,554), and Arabic (3,682) Wikipedia. The breakdown by quality class and language edition are as follows:
Quality class (based on English Wikipedia) | Language | Number articles |
---|---|---|
Stub | Arabic | 3448 |
Stub | English | 1726 |
Stub | French | 3811 |
Start | Arabic | 166 |
Start | English | 2056 |
Start | French | 2965 |
C | Arabic | 17 |
C | English | 2809 |
C | French | 601 |
B | Arabic | 15 |
B | English | 867 |
B | French | 76 |
GA | Arabic | 19 |
GA | English | 415 |
GA | French | 55 |
FA | Arabic | 17 |
FA | English | 64 |
FA | French | 46 |
min(1, 5/14)
) for that feature while an article with 20 categories would have a score of 1 (min(1, 20/14)
). Certain global minimum thresholds are also set based on eye-balling the data at this stage too.
Licenses
[edit]- Code: MIT License
- Model: CC0 License
Citation
[edit]Cite this model as:
@misc{johnson2022quality,
title={Language-agnostic Wikipedia article quality model card},
author={Johnson, Isaac},
year={2022},
url = {https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality_model_card},
}
References
[edit]- ↑ Most wikis have thresholds that are higher than this minimum.