Jump to content

Machine learning models/Production/English Wikipedia draft topic

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
CodeORES Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?English Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic a draft is in? Answering this question is useful for allocating editor resources and various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes a draft to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics) are deployed to predict the topic of articles (not drafts) across about a dozen other projects. There is also a language agnostic article topic model, which is explicitly only meant to be used for pages in namespace 0. This model is explicitly designed to work in the draft namespace (namespace 118); please only use it there.

This model may be useful for high-level analyses of article creation dynamics, as well as in helping to coordinate the allocation of senior editor time (for example, if an editor for a wikiproject wants to keep track of all drafts in a given category to provide feedback).

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends
  • filtering to relevant drafts — e.g. filter articles only to those in the music category.
  • helping to coordinate the allocation of senior editor time within a given topic
Don't use this model for
  • definitively establishing what topic a draft pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call: https://ores.wikimedia.org/v3/scores/enwiki/1152472770/drafttopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix

Test data sample rates:

Test data sample rates

Test data performance:

Test data performance


Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "presort": "deprecated",
        "min_weight_fraction_leaf": 0.0,
        "center": false,
        "validation_fraction": 0.1,
        "max_depth": 5,
        "max_features": "log2",
        "learning_rate": 0.1,
        "label_weights": {},
        "subsample": 1.0,
        "criterion": "friedman_mse",
        "warm_start": false,
        "random_state": null,
        "loss": "deviance",
        "init": null,
        "n_iter_no_change": null,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "ccp_alpha": 0.0,
        "tol": 0.0001,
        "scale": false,
        "verbose": 0,
        "min_impurity_decrease": 0.0,
        "min_samples_leaf": 1,
        "population_rates": null,
        "max_leaf_nodes": null,
        "n_estimators": 150,
        "min_impurity_split": null,
        "min_samples_split": 2,
        "multilabel": true
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "properties": {
        "probability": {
            "properties": {
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                }
            },
            "type": "object",
            "description": "A mapping of probabilities onto each of the potential output labels"
        },
        "prediction": {
            "items": {
                "type": "string"
            },
            "type": "array",
            "description": "The most likely labels predicted by the estimator"
        }
    },
    "type": "object"
}
Example input and output
GET https://ores.wikimedia.org/v3/scores/enwiki/1152472770/drafttopic
Example output
{
  "enwiki": {
    "models": {
      "drafttopic": {
        "version": "1.3.0"
      }
    },
    "scores": {
      "1152472770": {
        "drafttopic": {
          "score": {
            "prediction": [
              "STEM.STEM*"
            ],
            "probability": {
              "Culture.Biography.Biography*": 0.06452236665740299,
              "Culture.Biography.Women": 0.006796060518028034,
              "Culture.Food and drink": 0.0034858262488036413,
              "Culture.Internet culture": 0.01316980405300498,
              "Culture.Linguistics": 0.035812004854272474,
              "Culture.Literature": 0.10115882192713645,
              "Culture.Media.Books": 0.03692282820672749,
              "Culture.Media.Entertainment": 0.010711530036756643,
              "Culture.Media.Films": 0.0018885814226450477,
              "Culture.Media.Media*": 0.060425961704107765,
              "Culture.Media.Music": 0.0005378419722095707,
              "Culture.Media.Radio": 0.00018413020403263772,
              "Culture.Media.Software": 0.01752211431542162,
              "Culture.Media.Television": 0.00014328795512380692,
              "Culture.Media.Video games": 0.0009752589178008641,
              "Culture.Performing arts": 0.0013685116679310173,
              "Culture.Philosophy and religion": 0.00846981320538092,
              "Culture.Sports": 0.003389583061610164,
              "Culture.Visual arts.Architecture": 0.0010183172994406827,
              "Culture.Visual arts.Comics and Anime": 0.0006059058820116823,
              "Culture.Visual arts.Fashion": 0.0006165224577984327,
              "Culture.Visual arts.Visual arts*": 0.0026821398497912726,
              "Geography.Geographical": 0.002990758243166088,
              "Geography.Regions.Africa.Africa*": 0.024533447773077168,
              "Geography.Regions.Africa.Central Africa": 0.0005372129002643086,
              "Geography.Regions.Africa.Eastern Africa": 0.00010226239258595537,
              "Geography.Regions.Africa.Northern Africa": 0.0019899126298877174,
              "Geography.Regions.Africa.Southern Africa": 0.0005447025091373446,
              "Geography.Regions.Africa.Western Africa": 0.0004794410419489356,
              "Geography.Regions.Americas.Central America": 0.00012002646395075807,
              "Geography.Regions.Americas.North America": 0.014892613371859462,
              "Geography.Regions.Americas.South America": 8.654232339890368e-05,
              "Geography.Regions.Asia.Asia*": 0.06598737942971031,
              "Geography.Regions.Asia.Central Asia": 0.0010281638982847462,
              "Geography.Regions.Asia.East Asia": 0.057227777418167265,
              "Geography.Regions.Asia.North Asia": 0.003630396633774084,
              "Geography.Regions.Asia.South Asia": 0.003770692176326443,
              "Geography.Regions.Asia.Southeast Asia": 0.0014236857900671192,
              "Geography.Regions.Asia.West Asia": 0.0025550618998285823,
              "Geography.Regions.Europe.Eastern Europe": 0.011534806088388073,
              "Geography.Regions.Europe.Europe*": 0.03968287435557793,
              "Geography.Regions.Europe.Northern Europe": 0.007221978589819678,
              "Geography.Regions.Europe.Southern Europe": 0.003195096135472935,
              "Geography.Regions.Europe.Western Europe": 0.033237338362704344,
              "Geography.Regions.Oceania": 0.013285832354418657,
              "History and Society.Business and economics": 0.0049991531676727725,
              "History and Society.Education": 0.0016651028696874242,
              "History and Society.History": 0.018926782363533286,
              "History and Society.Military and warfare": 0.13824543680295232,
              "History and Society.Politics and government": 0.07383956357812226,
              "History and Society.Society": 0.03989083348214337,
              "History and Society.Transportation": 0.002495466492861629,
              "STEM.Biology": 0.008636873191791305,
              "STEM.Chemistry": 0.00770361972201387,
              "STEM.Computing": 0.15437336403388235,
              "STEM.Earth and environment": 0.0015758212652492313,
              "STEM.Engineering": 0.0030028383273395993,
              "STEM.Libraries & Information": 0.24354593114533452,
              "STEM.Mathematics": 0.028947136516221175,
              "STEM.Medicine & Health": 0.004470949173618504,
              "STEM.Physics": 0.007280836697859179,
              "STEM.STEM*": 0.9164947800360729,
              "STEM.Space": 0.0008405497649641965,
              "STEM.Technology": 0.07558409273303593
            }
          }
        }
      }
    }
  }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically separated from training data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).

Licenses

[edit]

Citation

[edit]

Cite this model as:

@misc{2020_enwiki_articletopic,
   title={English Wikipedia draft topic},
   author={Hal Triedman},
   year={2023},
   url={https://meta.wikimedia.org/Machine_learning_models/Production/English_Wikipedia_draft_topic}
}