Machine learning models/Production/Language-agnostic Wikipedia article quality

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Isaac Johnson
Model owner(s)	Isaac Johnson
Model interface	English Wikipedia example
Past performance	documentation
Code	Gitlab
Uses PII	No
	This model uses the structure and size of an article to predict quality scores for Wikipedia articles.
	v; t; e;

This model card describes a model for predicting the quality of Wikipedia articles. It uses structural features extracted from the article and a simple set of weights and wiki-specific normalization criteria to label Wikipedia articles in any language with a score between 0 and 1 (that can then be mapped to more recognized article quality classes such as Stubs). These scores are relative to a given language edition (not directly comparable across languages). The weights and feature selection were trained on editor asssessments from Arabic, English, and French Wikipedia. This model is a prototype and may still be substantially updated.

Motivation

Wikipedia articles range in quality from rich, well-illustrated, fully-referenced articles that fully cover their topic and are easy to read to single sentence stubs that define the topic of the article but do not offer much more information. It is very useful to be able to reliably distinguish between these extremes and the various stages of quality along this spectrum. Wikipedia editors have developed rich rubrics for how to evaluate the quality of Wikipedia articles and are constantly assessing article quality to assist in coordinating work on the wikis (English Wikipedia example). Editors use these quality scores to evaluate and prioritize their work. Researchers use these quality scores to understand content dynamics. Developers use these quality scores as filters when building recommender systems or other tools.

Wikipedia is ever-changing though, which makes it time-consuming (and largely impossible) for editors to keep these quality assessments complete and up-to-date. An automatic quality model can help fill these gaps by evaluating the quality for articles that are unassessed or have changed substantially since they were last assessed. In doing so, it can provide researchers and tool developers with more consistent data and even potentially help editors identify articles that would benefit from a human assessment. Initial models were language-specific, which allowed them to be finely-tuned to the dynamics and existing quality classes of a particular language edition. This model approach is language-agnostic (works for all Wikipedia language editions). The model may require further fine-tuning for a given community to better align its scores with existing quality classes, but this approach ensures that all language editions, even those lacking their own quality assessment schema, can benefit from these labels.

Users and uses

Use this model for

high-level analyses of article quality trends (example)
filtering / ranking articles in tools – e.g., only show low-quality articles in a recommender system
identifying potential ways to improve articles – e.g., using the lowest-value feature from the model as a recommendation

Don't use this model for

projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
namespaces outside of 0, disambiguation pages, and redirects
directly comparing article quality across language editions – the scores are mostly relative to a given Wikipedia so e.g., an article that received a 0.5 score on English Wikipedia would get a much higher score if it had been on Simple English Wikipedia instead (because high-quality articles on English Wikipedia generally have more content than high-quality articles on Simple English Wikipedia)

Current uses

Ethical considerations, caveats, and recommendations

The weights used in this quality model were derived from a groundtruth dataset based on quality assessments made by editors on Arabic, English, and French Wikipedia (using the PageAssessments Extension). The model therefore reflects how editors in these communities weight the value of different aspects of an article, which may or may not extend to other language editions.
The model does not currently take into account the quality of the specific writing, so a long article with many fake words would register as high quality. It does take into account structure though, so a long article would be penalized if it did not have many sections or was poorly referenced.
The scores are relative for a given wiki – i.e. the feature scores needed to achieve a high quality prediction vary by wiki. For instance, a high-quality article on English Wikipedia is expected to have at least 3 images while only 2 are required on Swedish Wikipedia (and in fact, less than 5% of articles on Swedish Wikipedia have more than 2 or more images).
The predicted scores in many wikis skew higher than the groundtruth assessments provided by Wikipedians. Some of this can be tempered by calibrating the thresholds used for mapping the predictions to classes (see the model evaluation for recommended thresholds) but even so, the model seems to be more optimistic than Wikipedians (likely capturing the many articles with a lot of content but that perhaps would benefit from improved readability, background, etc.)

Model

Performance

Implementation

Model architecture

linear regression model without an intercept that is mapped to an output range of 0 to 1. It can also be thought of as a weighted-average of features that is derived from a linear regression model.

Output schema

{
  lang: <language-code string>,
  title: <title string>,
  quality: <float [0-1]>
  features: {
    normalized: {
      <feature 1>: <float [0-1]>
      ...
      <feature n>: <float [0-1]>
    }, 
    raw: {
      <feature 1>: <int [0-]>
      ...
      <feature n>: <int [0-]>

    }
}

Example input and output

Input

GET /api/v1/quality-article-features?lang=en&title=Frida_Kahlo

Output

{
  "lang": "en",
  "title": "Frida_Kahlo",
  "quality": 0.9019721559418881,
  "features": {
    "normalized": {
      "categories": 1,
      "headings": 0.4688479565973076,
      "length (bytes)": 1,
      "media": 1,
      "references": 0.8304080512730345,
      "wikilinks": 0.4578991720468065
    },
    "raw": {
      "categories": 28,
      "headings": 19,
      "length (bytes)": 123748,
      "media": 20,
      "references": 86,
      "wikilinks": 351
    }
  }
}

Feature	Weight	Pre-processing	Minimum threshold for top quality^[1]
Page length	0.395	Square-root of number of bytes in wikitext	10000 characters
References	0.181	# ref tags / normalized-page-length	0.15 (roughly equivalent to 2 refs / section)
Sections	0.123	Number of level 2 and 3 headings / normalized-page-length	0.1 (1 heading at 100 chars, 2 headings at 400 chars, etc.)
Wikilinks	0.115	Square root of # of wikilinks (ns=0) / normalized-page-length	0.1 (~1 link per sentence)
Media	0.114	raw count of number of media files – e.g., image, video, audio – in wikitext	2
Categories	0.070	raw count of categories in wikitext	5

Data

The model weights are based on 19,173 articles that whose quality was assessed by editors in December 2021 across English (7,937), French (7,554), and Arabic (3,682) Wikipedia. The breakdown by quality class and language edition are as follows:

Groundtruth data distribution
Quality class (based on English Wikipedia)	Language	Number articles
Stub	Arabic	3448
Stub	English	1726
Stub	French	3811
Start	Arabic	166
Start	English	2056
Start	French	2965
C	Arabic	17
C	English	2809
C	French	601
B	Arabic	15
B	English	867
B	French	76
GA	Arabic	19
GA	English	415
GA	French	55
FA	Arabic	17
FA	English	64
FA	French	46

Data pipeline

The pipeline has two stages: 1) learning feature weights, and, 2) deriving pre-processinig thresholds. In the first stage, a small sample of data is used to learn the relative weight of each of the model features – e.g., categories, text, etc. This stage is also used for testing different feature transformations such as log-normalization. In the second stage, features for every Wikipedia article are computed and the top 5% of articles for each wiki and feature are used to determine what a "high-quality" article should attain in a given wiki and therefore how to compute feature weightsw – e.g., if the top 5% of articles in English Wikipedia have 14 categories, then an article with 5 categories will have a score of 0.36 (min(1, 5/14)) for that feature while an article with 20 categories would have a score of 1 (min(1, 20/14)). Certain global minimum thresholds are also set based on eye-balling the data at this stage too.

Training data

A small sample of recently-assessed Wikipedia articles from the wikis described above was used to derive the model weights.

Test data

See this PAWS notebook for a detailed model evaluation. Testing the model on a sample of data from Arabic, French, and English from several months after the training shows a high correlation between model predictions and Wikipedian assessments. Note: this evaluation does not yet include any language editions not also included in the training data.

Licenses

Code: MIT License
Model: CC0 License

Citation

Cite this model as:

@misc{johnson2022quality,
   title={Language-agnostic Wikipedia article quality model card},
   author={Johnson, Isaac},
   year={2022},
   url = {https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality_model_card},
}

References

↑ Most wikis have thresholds that are higher than this minimum.

[1] Most wikis have thresholds that are higher than this minimum.

[1]