Template:Model card ORES article topic
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Aaron Halfaker (User:EpochFail) and Amir Sarabadani |
Model owner(s) | WMF Machine Learning Team (ml@wikimediafoundation.org) |
Model interface | Ores homepage |
Code | drafttopic Github, ORES training data, and ORES model binaries |
Uses PII | No |
In production? | Yes |
Which projects? | {{{language}}} {{{project}}} |
This model uses article text to predict the likelihood that the article belongs to a set of topics. | |
Motivation
[edit]How can we predict what general topic an article is in? Answering this question is useful for various analyses of {{{project}}} dynamics. However, it is difficult to group a very diverse range of {{{project}}} articles into coherent, consistent topics manually.
This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.
This model may be useful for high-level analyses of {{{project}}} dynamics (pageviews, article quality, edit trends) and filtering articles.
Users and uses
[edit]- high-level analyses of {{{project}}} dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
- filtering to relevant articles — e.g. filter articles only to those in the music category.
- definitively establishing what topic an article pertains to
- automated editing of articles or topics without a human in the loop
This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of {{{project}}}, platform research, and other on-wiki tasks.
Example API call:{{{model_input}}}
Ethical considerations, caveats, and recommendations
[edit]- This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
- This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
- This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.
Model
[edit]Performance
[edit]Test data confusion matrix: {{{confusion_matrix}}}
Test data sample rates: {{{sample_rates}}}
Test data performance: {{{performance}}}
Implementation
[edit]{{{model_input}}}
Output:
{{{model_output}}}Data
[edit]Licenses
[edit]- Code: MIT license
- Model: MIT license
Citation
[edit]Cite this model card as:
@misc{
Triedman_Bazira_2023_{{{language}}}_{{{project}}}_article_topic,
title={ {{{language}}} {{{project}}} article topic model card },
author={ Triedman, Harold and Bazira, Kevin },
year={ 2023 },
url={ https://meta.wikimedia.org/wiki/Model_card_ORES_article_topic }
}