Jump to content

Machine learning models/Proposed/Wikidata item completeness

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Isaac Johnson
Model owner(s)Isaac Johnson
Model interfacehttps://wikidata-quality.wmcloud.org/api/item-scores
Past performancetask T321224
Codehttps://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/tree/master/annotation-gap
Uses PIINo
In production?No
Which projects?Wikidata
This model uses Wikidata items to predict missing claims, references, and labels for Wikidata items.


This model aims to determine the completeness of a given Wikidata item by essentially predicting what remaining properties, labels, and references should be added. It is similar but distinct from the approach taken by prior Wikidata item quality models, which correlate much more strongly with how many claims are present in an item regardless of that item's type (instance-of). It is inspired by approaches like Recoin that recommend missing claims for items and adds in support for labels (based on sitelinks) and references (based on Amaral et al.[1]).

Motivation

[edit]

This model can serve a few important purposes:

  • Estimating item completeness to help understand gaps in the Wikimedia projects. This is not just about Wikidata given that many Wikipedia articles draw important information from their corresponding Wikidata items.
  • Helping editors on Wikidata to identify tasks -- having an estimate of completeness combined with some metric for priority (e.g., total number of sitelinks) would enable better recommender systems for Wikidata.

Users and uses

[edit]
Use this model for
  • Suggesting properties/labels/references for editors to add (task recommender)
  • Estimating Wikidata completeness (analysis)
Don't use this model for
  • Automatically adding properties to Wikidata items
Current uses

Ethical considerations, caveats, and recommendations

[edit]

This model is based on the properties and references that currently exist on Wikidata. As such, to the degree that Wikidata is incomplete, the estimates of completeness made by the model will also be underestimates. It should be used as a start but will evolve as Wikidata evolves.

Model

[edit]

Performance

[edit]

Implementation

[edit]
Model architecture
The model essentially predicts what properties, references, and labels are expected for a given item based on its instance-of property (and sitelinks for labels). It does this by determining what the expected coverage is for each of these based on other items with that instance-of property. The corresponding class labels of E (the worst) through A (the best) are then predicted using ordinal logistic regression (training code) based on a dataset of Wikidata items whose quality was evaluated previous (details).
Output schema
{
  "item": "https://www.wikidata.org/wiki/Q20909",
  "predicted-completeness": "D",
  "predicted-quality": "A",
  "features": {
    "claim-completeness": 0.8063980787390058,
    "label-desc-completeness": 0.75,
    "num-claims": 79,
    "ref-completeness": 0.4174470862702438
  }
}
Example input and output

Data

[edit]
Data pipeline
Training data
Test data


Licenses

[edit]

Citation

[edit]

Cite this model as:

@misc{name_year_modeltype,
   title={Wikidata Item Completeness},
   author={Johnson, Isaac},
   year={2024},
   url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Wikidata_item_completeness}
}

References

[edit]
  1. Amaral, Gabriel; Piscopo, Alessandro; Kaffee, Lucie-aimée; Rodrigues, Odinaldo; Simperl, Elena (2021-10-15). "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach". J. Data and Information Quality 13 (4): 23:1–23:35. ISSN 1936-1955. doi:10.1145/3484828.