Jump to content

Template:Model card wikidata item topic

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Wikidata
This model uses item features to predict the likelihood that the item belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an item is in? Answering this question is useful for various analyses of Wikidata dynamics. However, it is difficult to group a very diverse range of Wikidata items into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an item to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikidata dynamics (pageviews, item quality, edit trends) and filtering items.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikidata dynamics such as pageview, item quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant items — e.g. filter items only to those in the music category.
Don't use this model for
  • definitively establishing what topic an items pertains to
  • automated editing of items or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikidata, platform research, and other on-wiki tasks.

Example API call:
{{{model_input}}}

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikidata has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix: {{{confusion_matrix}}}

Test data sample rates: {{{sample_rates}}}

Test data performance: {{{performance}}}

Implementation

[edit]
Model architecture
{{{model_architecture}}}
Output schema
{{{model_output_schema}}}
Example input and output
Input:
{{{model_input}}}

Output:

{{{model_output}}}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an item embedding. Finally, labels are derived from the mid-level WikiProject categories that the item is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Wikidata_item_topic,
  title={ Wikidata item topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Model_card_wikidata_item_topic }
}