Research:ReferenceRisk

Created

15:22, 22 February 2024 (UTC)

Contact

Diego Sáez Trumper

Wikimedia Foundation

Francisco Navas

Wikimedia Enterprise

Collaborators

Pablo Aragón

Wikimedia Foundation

Aitolkyn Baigutanova

KAIST

Duration: 2024-February – 2024-September

References, Knowledge Integrity, Disinformation

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This page in a nutshell: This page will hold all updates and information related to the ML scores developed by WMF Research tentatively named, reference need and reference risk. These two scores seek to make it easier to understand the quality of references on Wikipedia.

What is this project?

A typical Wikipedia article has three atomic units that combine to craft the claims we read — 1) the editor that creates the edit 2) the edit itself 3) the reference that informs the edit. This project focuses on the latter of the three.

Wikipedia's verifiability principle, expects all editors to be responsible for the content they add, ruling that the "burden to demonstrate verifiability lies with the editor who adds or restores material". Would this edict be followed to the letter, every claim across Wikipedia would be dutifully cited inline. Of course, life falls short of perfection, and it is exactly the inherently imperfect participation of the human editor that leads to change, debate and flux, creating “quality” claims and articles, by any standard, in the long term.^{[citation needed]}

Then, there is the additional task of understanding the reference itself. What is in the reference? Where does it come from? Who made it? Wikipedia communities have various efforts in trying to lessen that task, namely the reliable sources list.

Yet, there is no silver-bullet solution to understanding how our communities, across languages and projects, manage citation quality.

As a collaboration between Wikimedia Enterprise and Research with the set goal of refining and productionizing previous work by the Research’s Citation quality ML model from the following paper — "Longitudinal Assessment of Reference Quality on Wikipedia" ^[1], we seek to lessen the burden of understanding the quality of a single reference. The result of which will cater to everyone from individual volunteer editors to high-volume third-party reusers.

Both Research and Enterprise understand that a broad range of actors in the online knowledge environment stand to benefit from the ability to evaluate citations at scale and in near real time.

Because manually inspecting sources or developing external algorithmic methods are costly and time-consuming we would like to host a scoring model that may be leveraged by customers and the community to automatically identify low and high quality citation data.

Components

We originally operationalize reference quality in two metrics: 1) reference need which measures the proportion of claims in the article content that are missing citations, and 2) reference risk which evaluates the proportion of risky references among the ones cited in an article ^[1]. Here, we elaborate on how the two scores are modified for the actual production. The two models are developed separately and can be used independently of each other.

Reference Need

Our first score is reference need. We fine-tune language model mBERT to predict the probability of a sentence in an article to require a citation. With the predicted label for each sentence, we compute the overall reference need score for an article.

The original definition of reference need is the proportion of the uncited sentences that need a citation. We make a slight modification to this definition to only consider the proportion of the uncited sentences that need a citation among uncited sentences. This means that if an editor added a reference to a sentence that sentence is considered to need a citation regardless of the model output. Hence, the model prediction is only run on uncited claims.

Reference Risk

Our second score tries to evaluate the quality of the cited sources themselves. However, since predicting reliability is inherently challenging, we instead focus on providing features that could assist the user in making a self-assessment, ultimately leaving the decision on reliability to the user. Thus, the definition of reference risk score is to evaluate the likelihood of an added reference to survive on the page, which is inferred from the edit history metadata by source.

Findings

Reference Need

In this work, we fine-tune a multilingual BERT model for the reference need detection task. Our model takes a wiki_db and revision id as input and computes the reference need score for the given revision. Per sentence, the model input includes language code, section title, sentence, subsequent sentence, and preceding sentence in a paragraph. We trained on a sample of 20,000 sentences from featured articles of five wikis, English, Spanish, French, German, and Russian. Due to teh trade-off between the accuracy and latency of the model, we limit the input context size to 128 tokens, although the maximum BERT accepts is 512. More details on the model can be found in the model card. The test data includes 3,000 sentences sampled from a holdout set of pages in our dataset. Performance of the model on the test set is reported below:

Accuracy 0.706
ROC-AUC 0.781
PR-AUC 0.783
Precision: 0.705
F1-score 0.707

Reference Risk

We examine historical occurrences of domains in Wikipedia articles until the year 2024, to identify informative features. The feature we found meaningful as a reference risk indicator is the survival edit ratio, which is the proportion of edits a domain survives since its first addition to a page. For example, if page A has a total of 100 revisions until now, and ‘bbc.com’ was added to page A in its 10th revision and still remains, then the survival edit ratio of ‘bbc.com’ is 90/100.

We utilize the community-maintained perennial sources list as our ground-truth labeling. The list includes 5 categories: blacklisted (B), deprecated (D), generally unreliable (GU), no consensus (NC), and generally reliable (GR). We merge the first two categories as undesirable to use, and the last two as no risk, thus having three groups used for the comparison. The distribution of four aggregations (mean, median, 25th, and 75th percentiles) of the target feature within the three groups is shown in the plots. We observe that no-risk category sources tend to survive more edits on the article.

What’s next?

Post quarterly updates
Build community-centered performance testing strategy

Model Cards

Related Projects

References

↑ ^a ^b Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 2831–2839. https://doi.org/10.1145/3543507.3583218

[www2023-1] Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 2831–2839. https://doi.org/10.1145/3543507.3583218

[1]

What is this project?

Components

Reference Need

Reference Risk

Findings

Reference Need

Reference Risk

What’s next?

Model Cards

Related Projects

Related Reading

References