Machine learning models/Proposed/Reference Verification for Wikidata
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Gabriel Amaral, Odinaldo Rodrigues, and Elena Simperl |
Model owner(s) | King's KG Lab |
Model interface | https://www.wikidata.org/wiki/Wikidata:ProVe |
Past performance | https://app.swaggerhub.com/apis/YihangZhao/pro-ve_api/1.0.0 |
Publications | https://www.semantic-web-journal.net/content/prove-pipeline-automated-provenance-verification-knowledge-graphs-against-textual-sources |
Code | https://github.com/King-s-Knowledge-Graph-Lab/RQV |
Uses PII | No |
In production? | This is hosted by the King's KG team at King's VM. |
Which projects? | https://www.wikidata.org/wiki/Wikidata:WikiProject_Reference_Verification#cite_note-1 |
This model uses a pair of claim and provenance sentences to predict supportiveness of provenance sentence about reference URLs of claims. | |
Most Wikidata claims are built from external resources, so it has numerous reference URLs to designate the original provenance of claims. However, URL verification of claims should be conducted to ensure the quality of references, because web document structures accessed via URLs change dynamically over time, and some claims may become outdated, e.g., a claim about someone's occupation. To verify references of claims, humans should visit the web documents via their URLs, read the entire manuscript, then determine if the webpage includes supportive sentences for the claim. This is a non-trivial and labor-intensive task for Wikidata editors. Furthermore, it's becoming an increasingly challenging task due to the drastic growth of Wikidata resources.
With advances in machine learning and language models, we can avoid exhausting investigations of large amounts of text, such as reference URL verification with its claims. We have trained three types of language models based on BERT and T5 with our own training dataset built on a crowdsourced labeled set. The first T5-based model is designed to verbalize a Wikidata claim into a natural sentence to compare with sentences extracted from the web document of the claim's URL reference. The second BERT-based model is designed for finding the most relevant sentences from the set of sentences extracted from the web document using the verbalized sentence. The third BERT-based model is fine-tuned with a crowdsourced training dataset to determine if the sentence extracted from web documents is supportive of the verbalized sentence or not.
Motivation
[edit]The motivation behind these models is to help Wikidata editors determine if the reference URLs are supportive of the Wikidata claims. By adopting these language models, a Wikidata editor can avoid reading the entire web document of a reference URL to check the supportiveness of the claim.
We have already developed these models on King's VM, and they are running live. You can test our tool as a Wikidata gadget on the Wikidata item page to see the results of these models. Further details are available here: https://www.wikidata.org/wiki/Wikidata:WikiProject_Reference_Verification
Users and uses
[edit]- Checking supportiveness of a pair of claim and provenance sentences
- Finding the most relevant sentences from long text data using a given sentence
- Fact-checking for general purposes
- Checking supportiveness of overly long sentences. This model is designed for simple sentences like Wikidata claims.
- Numerical-centric sentences such as birth and death dates. This model is designed to understand more language-centric sentences rather than numerical data.
- High-level logical inference like unseen fact checks or transitive fact checks.
Ethical considerations, caveats, and recommendations
[edit]Model
[edit]Performance
[edit]Implementation
[edit]{{ContentGrid|content=
Here's the updated version with unavailable information removed:
- Window size: 512
- Embeddings dimension: 768
- Vocab size: 30,522
- Total number of embeddings params: 23,440,896
- Model architecture: BertForSequenceClassification
- Number of hidden layers: 12
- Number of attention heads: 12
- Intermediate size: 3072
- Problem type: Single label classification (3 classes)
{
sentences: [<sentence-1>, <sentence-2>],
results:
{SUPPROTS: <score \in (0,1)>, REFUTES: <score \in (0,1)>, NOT ENOUGH INFO: <score \in (0,1)>}
}
Data
[edit]Training dataset information on GitHub: https://github.com/gabrielmaia7/RSP?tab=readme-ov-file
Licenses
[edit]- Code:
- Model:
Citation
[edit]Cite this model as:
@misc{name_year_modeltype,
title={Model card title},
author={Lastname, Firstname (and Lastname, Firstname and...)},
year={year},
url={this URL}
}