Machine learning models/Proposed/Wikidata item completeness
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Isaac Johnson |
Model owner(s) | Isaac Johnson |
Model interface | https://wikidata-quality.wmcloud.org/api/item-scores |
Past performance | task T321224 |
Code | https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/tree/master/annotation-gap |
Uses PII | No |
In production? | No |
Which projects? | Wikidata |
This model uses Wikidata items to predict missing claims, references, and labels for Wikidata items. | |
This model aims to determine the completeness of a given Wikidata item by essentially predicting what remaining properties, labels, and references should be added. It is similar but distinct from the approach taken by prior Wikidata item quality models, which correlate much more strongly with how many claims are present in an item regardless of that item's type (instance-of). It is inspired by approaches like Recoin that recommend missing claims for items and adds in support for labels (based on sitelinks) and references (based on Amaral et al.[1]).
Motivation
[edit]This model can serve a few important purposes:
- Estimating item completeness to help understand gaps in the Wikimedia projects. This is not just about Wikidata given that many Wikipedia articles draw important information from their corresponding Wikidata items.
- Helping editors on Wikidata to identify tasks -- having an estimate of completeness combined with some metric for priority (e.g., total number of sitelinks) would enable better recommender systems for Wikidata.
Users and uses
[edit]- Suggesting properties/labels/references for editors to add (task recommender)
- Estimating Wikidata completeness (analysis)
- Automatically adding properties to Wikidata items
Ethical considerations, caveats, and recommendations
[edit]This model is based on the properties and references that currently exist on Wikidata. As such, to the degree that Wikidata is incomplete, the estimates of completeness made by the model will also be underestimates. It should be used as a start but will evolve as Wikidata evolves.
Model
[edit]Performance
[edit]Implementation
[edit]{
"item": "https://www.wikidata.org/wiki/Q20909",
"predicted-completeness": "D",
"predicted-quality": "A",
"features": {
"claim-completeness": 0.8063980787390058,
"label-desc-completeness": 0.75,
"num-claims": 79,
"ref-completeness": 0.4174470862702438
}
}
Data
[edit]
Licenses
[edit]- Code: MIT License
- Model: CC0 License
Citation
[edit]Cite this model as:
@misc{name_year_modeltype,
title={Wikidata Item Completeness},
author={Johnson, Isaac},
year={2024},
url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Wikidata_item_completeness}
}
References
[edit]- ↑ Amaral, Gabriel; Piscopo, Alessandro; Kaffee, Lucie-aimée; Rodrigues, Odinaldo; Simperl, Elena (2021-10-15). "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach". J. Data and Information Quality 13 (4): 23:1–23:35. ISSN 1936-1955. doi:10.1145/3484828.