Jump to content

Machine learning models/Proposed/Language-agnostic reference risk

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aitolkyn Baigutanova, Pablo Aragón, Muniza Aslam, and Diego Saez-Trumper
Model owner(s)Pablo Aragón and Diego Saez-Trumper
CodeInference
Uses PIINo
In production?TBA
This model uses edit history metadata to predict the likelihood of a reference to survive on a Wikipedia article.


This model card describes a model for summarizing the likelihood of a reference to survive on a Wikipedia article, by using historical editorial activity of revisions involving web domains as references. We use this probability as a proxy for source reliability which we refer to as reference risk. The features are relative to a given language edition. This model is a prototype and may still be substantially updated.

Motivation

[edit]

The credibility of the content presented in Wikipedia articles largely depends on the reliability of references. URL-based references, in particular, can vary widely in quality, ranging from highly credible sources to unreliable or even spam and misleading websites. The burden of verifying the reliability of a source lies on the editors. Therefore, this model aims to provide insights into reference reliability based on the historical editorial activity of the source. Thus, supporting Wikipedia users in assessing the quality of the cited sources. The model can guide editors in avoiding the addition of potentially low-quality sources, and assist readers in critically evaluating the information they consume. This model is language-agnostic and, hence, works for all language editions of Wikipedia.

Users and uses

[edit]
Use this model for
  • Overall reference risk assessment of Wikipedia articles involving web domains as references
  • URL-based reference-wise reference risk assessment
Don't use this model for
  • Projects outside of Wikipedia
  • Namespaces outside of 0
  • Evaluating a domain that has no previous usage history in the target language Wikipedia
Current uses

Ethical considerations, caveats, and recommendations

[edit]
  • The features were derived from the historical editorial activity of a web domain in the target Wikipedia edition. Therefore, the model does not explicitly predict domain reliability but presents a set of characteristics for human evaluation.
  • The model does not consider references not supplemented by a URL.
  • The same domain can have different feature values across wikis.
  • Some domains are assigned multiple labels by the community in the perennial sources list. In such cases, reference risk outputs the "worst" category. Note that widespread sources like forbes.com and guardian.com fall under this category in English wiki [1] [2].

Model

[edit]

The model outputs the following characteristics per page:

  • Snapshot of data based on which the features were computed
  • Wiki code
  • Number of URL-based references
  • Proportion of deprecated/blacklisted domains
  • Minimum survival edit ratio of cited domains
  • Mean survival edit ratio of cited domains
  • Median survival edit ratio of cited domains
  • List of URL-based references. The model outputs the following features per such reference:
    • Parsed URL
    • Domain extracted from the URL
    • Classification of the domain in the local perennial sources list (if any)
    • Classification of the domain in the English perennial sources list (if any)
    • Survival edit ratio (survival), i.e., the proportion of survived edits since domain addition
    • Page count, i.e., in how many pages in a corresponding wiki the domain was added
    • Editor count, i.e., how many distinct editors in a corresponding wiki cited the domain

Performance

[edit]

Implementation

[edit]
Model architecture
The model does not employ machine learning algorithms; rather, it provides a set of precomputed features per reference.
Output schema
{
  snapshot: <string>, 
  wiki_db: <string>,
  n_refs: <int [0-]>, 
  rr_score: <float [0-1]>,
  survival_min: <float [0-1]>, 
  survival_mean: <float [0-1]>, 
  survival_median: <float [0-1]>, 
  references: <list of dicts>
}
Example input and output

Input

{“rev_id”: 138858016, “lang”: ‘ru’}

Output

{
  snapshot: "2024-06", 
  wiki_db: "ruwiki",
  n_refs: 11, 
  rr_score: 0.0,
  survival_min: 0.8883834999235511, 
  survival_mean: 0.9229938197364226, 
  survival_median: 0.922618437634958, 
  references: [
    Reference(
      url: "https://www.boxofficemojo.com/title/tt0287467/? 
      ref_=bo_se_r_1", 
      domain: "boxofficemojo.com', 
      psl_local: None, 
      psl_enwiki: None, 
      survival: 0.93153055037733, 
      page_count: 4101, 
      editors_count: 859
    ), 
    Reference(
      url: "http://www.guardian.co.uk/film/2002/jul/31/features.pedroalmodovar", 
      domain: "guardian.co.uk', 
      psl_local: None, 
      psl_enwiki: "No consensus", 
      survival: 0.9060755798909812, 
      page_count: 3813, 
      editors_count: 1397
    ), 
    ...
  ]
}

Data

[edit]

The model computed features for a given domain are collected from all the historical appearances of that domain in a given wiki. The data extracted from the two tables are available at Wikimedia Data Lake. We use MediaWiki History and Wikitext History tables. The data is updated based on a monthly snapshot of these tables.

Additionally, we collect the reliability classification on the perennial sources list page. Specifically, we retrieve the domain and the corresponding status. We do not consider a source if either domain or status is missing in the current snapshot of the perennial sources list. The status of the source can be one of the following: Blacklisted, Deprecated, Generally Unreliable, No Consensus, or Generally Reliable. For frwiki, an explicit label is not assigned, in that special case, the model retrieves a brief description of the current consensus on source reliability.

Data pipeline

The data was collected using Wikimedia Data Lake and Wikimedia Analytics cluster. The data collection pipeline is as follows:

  1. Extract revisions with URL references.
  2. Extract domain from URL.
  3. Identify domain survival on pages. Aggregate per each (wiki_db, page_id, domain) to find the domain's first and last appearance on a given page.
  4. Compute the domain's survival edit ratio, as a proportion of the number of edits the domain stayed on the page over the total number of edits since addition.
  5. Compute domain-level features. Aggregate survival edit ratio per each (wiki_db, domain) pair. Identify the number of pages the domain appeared at and the number of distinct editors who cited the domain.
  6. Extract domain status from the perennial sources list. Retrieve reliability status in the local and English wikis and add them to the final set of features.
  7. Merge computed features with domain status such that we have one set of features per domain in a given wiki.
Training data
  • Latest monthly snapshot of the mediawiki_history and mediawiki_wikitext_history tables
  • Perennial sources list classification extracted once per month
Test data

Licenses

[edit]
  • Code:
  • Model:

Citation

[edit]

Cite this model as:

@misc{name_year_modeltype,
   title={Model card title},
   author={Lastname, Firstname (and Lastname, Firstname and...)},
   year={year},
   url={this URL}
}