Machine learning models/Production/Language-agnostic reference risk

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Aitolkyn Baigutanova, Pablo Aragón, Muniza Aslam, and Diego Saez-Trumper
Model owner(s)	Pablo Aragón and Diego Saez-Trumper
Code	Inference
Uses PII	No
In production?	TBA
	This model uses edit history metadata to predict the likelihood of a reference to survive on a Wikipedia article.
	v; t; e;

This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed.

This model card describes a model for summarizing the likelihood of a reference to survive on a Wikipedia article, by using historical editorial activity of revisions involving web domains as references. We use this probability as a proxy for source reliability which we refer to as reference risk. The features are relative to a given language edition. This model is a prototype and may still be substantially updated.

This model is to deployed on LiftWing and available via the API gateway and Wikimedia Enterprise.

Motivation

The credibility of the content presented in Wikipedia articles largely depends on the reliability of references. URL-based references, in particular, can vary widely in quality, ranging from highly credible sources to unreliable or even spam and misleading websites. The burden of verifying the reliability of a source lies on the editors. Therefore, this model aims to provide insights into reference reliability based on the historical editorial activity of the source. Thus, supporting Wikipedia users in assessing the quality of the cited sources. The model can guide editors in avoiding the addition of potentially low-quality sources, and assist readers in critically evaluating the information they consume. This model is language-agnostic and, hence, works for all language editions of Wikipedia.

Users and uses

Use this model for

Overall reference risk assessment of Wikipedia articles involving web domains as references
URL-based reference-wise reference risk assessment

Don't use this model for

Projects outside of Wikipedia
Namespaces outside of 0
Evaluating a domain that has no previous usage history in the target language Wikipedia

Current uses

Ethical considerations, caveats, and recommendations

The features were derived from the historical editorial activity of a web domain in the target Wikipedia edition. Therefore, the model does not explicitly predict domain reliability but presents a set of characteristics for human evaluation.
The model does not consider references not supplemented by a URL.
The same domain can have different feature values across wikis.
Some domains are assigned multiple labels by the community in the perennial sources list. In such cases, reference risk outputs the "worst" category. Note that widespread sources like forbes.com and guardian.com fall under this category in English wiki [1] [2].

Model

The model outputs the following characteristics per page:

Snapshot of data based on which the features were computed
Wiki code
Number of URL-based references
Proportion of deprecated/blacklisted domains
Minimum survival edit ratio of cited domains
Mean survival edit ratio of cited domains
Median survival edit ratio of cited domains
List of URL-based references. The model outputs the following features per such reference:
- Parsed URL
- Domain extracted from the URL
- Classification of the domain in the local perennial sources list (if any)
- Classification of the domain in the English perennial sources list (if any)
- Survival edit ratio (survival), i.e., the proportion of survived edits since domain addition
- Page count, i.e., in how many pages in a corresponding wiki the domain was added
- Editor count, i.e., how many distinct editors in a corresponding wiki cited the domain

Performance

Implementation

Model architecture

The model does not employ machine learning algorithms; rather, it provides a set of precomputed features per reference.

Output schema

{
  snapshot: <string>, 
  wiki_db: <string>,
  n_refs: <int [0-]>, 
  rr_score: <float [0-1]>,
  survival_min: <float [0-1]>, 
  survival_mean: <float [0-1]>, 
  survival_median: <float [0-1]>, 
  references: <list of dicts>
}

Example input and output

Input

{"rev_id": 138858016, "lang": "ru", "extended_output": "True"}

Output

{
  snapshot: "2024-06", 
  wiki_db: "ruwiki",
  n_refs: 11, 
  rr_score: 0.0,
  survival_min: 0.8883834999235511, 
  survival_mean: 0.9229938197364226, 
  survival_median: 0.922618437634958, 
  references: [
    Reference(
      url: "https://www.boxofficemojo.com/title/tt0287467/? 
      ref_=bo_se_r_1", 
      domain: "boxofficemojo.com', 
      psl_local: None, 
      psl_enwiki: None, 
      survival: 0.93153055037733, 
      page_count: 4101, 
      editors_count: 859
    ), 
    Reference(
      url: "http://www.guardian.co.uk/film/2002/jul/31/features.pedroalmodovar", 
      domain: "guardian.co.uk', 
      psl_local: None, 
      psl_enwiki: "No consensus", 
      survival: 0.9060755798909812, 
      page_count: 3813, 
      editors_count: 1397
    ), 
    ...
  ]
}

Data

The model computed features for a given domain are collected from all the historical appearances of that domain in a given wiki. The data extracted from the two tables are available at Wikimedia Data Lake. We use MediaWiki History and Wikitext History tables. The data is updated based on a monthly snapshot of these tables.

Additionally, we collect the reliability classification on the perennial sources list page. Specifically, we retrieve the domain and the corresponding status. We do not consider a source if either domain or status is missing in the current snapshot of the perennial sources list. The status of the source can be one of the following: Blacklisted, Deprecated, Generally Unreliable, No Consensus, or Generally Reliable. For frwiki, an explicit label is not assigned, in that special case, the model retrieves a brief description of the current consensus on source reliability.

Data pipeline

The data was collected using Wikimedia Data Lake and Wikimedia Analytics cluster. The data collection pipeline is as follows:

Extract revisions with URL references.
Extract domain from URL.
Identify domain survival on pages. Aggregate per each (wiki_db, page_id, domain) to find the domain's first and last appearance on a given page.
Compute the domain's survival edit ratio, as a proportion of the number of edits the domain stayed on the page over the total number of edits since addition.
Compute domain-level features. Aggregate survival edit ratio per each (wiki_db, domain) pair. Identify the number of pages the domain appeared at and the number of distinct editors who cited the domain.
Extract domain status from the perennial sources list. Retrieve reliability status in the local and English wikis and add them to the final set of features.
Merge computed features with domain status such that we have one set of features per domain in a given wiki.

Training data

Latest monthly snapshot of the mediawiki_history and mediawiki_wikitext_history tables
Perennial sources list classification extracted once per month

Test data

Licenses

Code:
Model:

Citation

Cite this model as:

@misc{name_year_modeltype,
   title={Model card title},
   author={Lastname, Firstname (and Lastname, Firstname and...)},
   year={year},
   url={this URL}
}