Machine learning models/Proposed/Wikidata item completeness

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Isaac Johnson
Model owner(s)	Isaac Johnson
Model interface	https://wikidata-quality.wmcloud.org/api/item-scores
Past performance	task T321224
Code	https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/tree/master/annotation-gap
Uses PII	No
In production?	No
Which projects?	Wikidata
	This model uses Wikidata items to predict missing claims, references, and labels for Wikidata items.
	v; t; e;

This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed.

This model aims to determine the completeness of a given Wikidata item by essentially predicting what remaining properties, labels, and references should be added. It is similar but distinct from the approach taken by prior Wikidata item quality models, which correlate much more strongly with how many claims are present in an item regardless of that item's type (instance-of). It is inspired by approaches like Recoin that recommend missing claims for items and adds in support for labels (based on sitelinks) and references (based on Amaral et al.^[1]).

Motivation

This model can serve a few important purposes:

Estimating item completeness to help understand gaps in the Wikimedia projects. This is not just about Wikidata given that many Wikipedia articles draw important information from their corresponding Wikidata items.
Helping editors on Wikidata to identify tasks -- having an estimate of completeness combined with some metric for priority (e.g., total number of sitelinks) would enable better recommender systems for Wikidata.

Users and uses

Use this model for

Suggesting properties/labels/references for editors to add (task recommender)
Estimating Wikidata completeness (analysis)

Don't use this model for

Automatically adding properties to Wikidata items

Current uses

Ethical considerations, caveats, and recommendations

This model is based on the properties and references that currently exist on Wikidata. As such, to the degree that Wikidata is incomplete, the estimates of completeness made by the model will also be underestimates. It should be used as a start but will evolve as Wikidata evolves.

Model

Performance

Implementation

Model architecture

The model essentially predicts what properties, references, and labels are expected for a given item based on its instance-of property (and sitelinks for labels). It does this by determining what the expected coverage is for each of these based on other items with that instance-of property. The corresponding class labels of E (the worst) through A (the best) are then predicted using ordinal logistic regression (training code) based on a dataset of Wikidata items whose quality was evaluated previous (details).

Output schema

{
  "item": "https://www.wikidata.org/wiki/Q20909",
  "predicted-completeness": "D",
  "predicted-quality": "A",
  "features": {
    "claim-completeness": 0.8063980787390058,
    "label-desc-completeness": 0.75,
    "num-claims": 79,
    "ref-completeness": 0.4174470862702438
  }
}

Example input and output

Data

Data pipeline

Training data

Test data

Licenses

Code: MIT License
Model: CC0 License

Citation

Cite this model as:

@misc{name_year_modeltype,
   title={Wikidata Item Completeness},
   author={Johnson, Isaac},
   year={2024},
   url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Wikidata_item_completeness}
}

References

↑ Amaral, Gabriel; Piscopo, Alessandro; Kaffee, Lucie-aimée; Rodrigues, Odinaldo; Simperl, Elena (2021-10-15). "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach". J. Data and Information Quality 13 (4): 23:1–23:35. ISSN 1936-1955. doi:10.1145/3484828.

[1] Amaral, Gabriel; Piscopo, Alessandro; Kaffee, Lucie-aimée; Rodrigues, Odinaldo; Simperl, Elena (2021-10-15). "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach". J. Data and Information Quality 13 (4): 23:1–23:35. ISSN 1936-1955. doi:10.1145/3484828.

[1]