Jump to content

Machine learning models/Production/Vietnamese Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Vietnamese Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/viwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 14139 12651 1488 789 45836
Culture.Biography.Women 5153 4127 1026 774 54837
Culture.Food and drink 1399 1015 384 122 59243
Culture.Internet culture 3531 2917 614 233 57000
Culture.Linguistics 1509 1165 344 69 59186
Culture.Literature 5659 4444 1215 505 54600
Culture.Media.Books 1467 1089 378 111 59186
Culture.Media.Entertainment 1984 1041 943 194 58586
Culture.Media.Films 2612 2073 539 173 57979
Culture.Media.Media* 13270 11633 1637 1282 46212
Culture.Media.Music 2908 2509 399 163 57693
Culture.Media.Radio 285 178 107 32 60447
Culture.Media.Software 2291 1910 381 288 58185
Culture.Media.Television 1983 1438 545 133 58648
Culture.Media.Video games 2139 1969 170 54 58571
Culture.Performing arts 1374 960 414 105 59285
Culture.Philosophy and religion 3040 1731 1309 295 57429
Culture.Sports 3871 3411 460 104 56789
Culture.Visual arts.Architecture 1854 1334 520 167 58743
Culture.Visual arts.Comics and Anime 2239 2029 210 78 58447
Culture.Visual arts.Fashion 1501 1213 288 105 59158
Culture.Visual arts.Visual arts* 6119 4741 1378 400 54245
Geography.Geographical 3991 2584 1407 552 56221
Geography.Regions.Africa.Africa* 5744 4467 1277 485 54535
Geography.Regions.Africa.Central Africa 1214 811 403 123 59427
Geography.Regions.Africa.Eastern Africa 451 216 235 40 60273
Geography.Regions.Africa.Northern Africa 1474 1036 438 95 59195
Geography.Regions.Africa.Southern Africa 1178 798 380 78 59508
Geography.Regions.Africa.Western Africa 672 537 135 41 60051
Geography.Regions.Americas.Central America 1588 992 596 116 59060
Geography.Regions.Americas.North America 5438 3677 1761 651 54675
Geography.Regions.Americas.South America 2210 1706 504 183 58371
Geography.Regions.Asia.Asia* 13867 11650 2217 1216 45681
Geography.Regions.Asia.Central Asia 1211 914 297 84 59469
Geography.Regions.Asia.East Asia 5571 4489 1082 445 54748
Geography.Regions.Asia.North Asia 1693 1283 410 235 58836
Geography.Regions.Asia.South Asia 2038 1624 414 108 58618
Geography.Regions.Asia.Southeast Asia 2632 1945 687 190 57942
Geography.Regions.Asia.West Asia 2178 1753 425 105 58481
Geography.Regions.Europe.Eastern Europe 3590 2868 722 277 56897
Geography.Regions.Europe.Europe* 12682 10215 2467 1220 46862
Geography.Regions.Europe.Northern Europe 2896 1842 1054 227 57641
Geography.Regions.Europe.Southern Europe 2854 2126 728 203 57707
Geography.Regions.Europe.Western Europe 4043 3153 890 237 56484
Geography.Regions.Oceania 2238 1660 578 126 58400
History and Society.Business and economics 3404 2099 1305 350 57010
History and Society.Education 1589 926 663 132 59043
History and Society.History 4595 2468 2127 569 55600
History and Society.Military and warfare 5048 3942 1106 446 55270
History and Society.Politics and government 4595 2624 1971 485 55684
History and Society.Society 6148 2979 3169 513 54103
History and Society.Transportation 3573 3290 283 82 57109
STEM.Biology 7137 6573 564 165 53462
STEM.Chemistry 1506 1175 331 170 59088
STEM.Computing 2452 2053 399 305 58007
STEM.Earth and environment 1649 1118 531 123 58992
STEM.Engineering 3027 2529 498 129 57608
STEM.Libraries & Information 489 359 130 32 60243
STEM.Mathematics 942 782 160 56 59766
STEM.Medicine & Health 1774 1261 513 144 58846
STEM.Physics 1374 1025 349 139 59251
STEM.STEM* 20595 18875 1720 892 39277
STEM.Space 1637 1485 152 39 59088
STEM.Technology 4233 3133 1100 585 55946

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.233 0.123
Culture.Biography.Women 0.085 0.015
Culture.Food and drink 0.023 0.002
Culture.Internet culture 0.058 0.003
Culture.Linguistics 0.025 0.007
Culture.Literature 0.093 0.015
Culture.Media.Books 0.024 0.004
Culture.Media.Entertainment 0.033 0.004
Culture.Media.Films 0.043 0.011
Culture.Media.Media* 0.218 0.058
Culture.Media.Music 0.048 0.024
Culture.Media.Radio 0.005 0.002
Culture.Media.Software 0.038 0.001
Culture.Media.Television 0.033 0.009
Culture.Media.Video games 0.035 0.003
Culture.Performing arts 0.023 0.003
Culture.Philosophy and religion 0.05 0.011
Culture.Sports 0.064 0.071
Culture.Visual arts.Architecture 0.031 0.011
Culture.Visual arts.Comics and Anime 0.037 0.002
Culture.Visual arts.Fashion 0.025 0.001
Culture.Visual arts.Visual arts* 0.101 0.018
Geography.Geographical 0.066 0.024
Geography.Regions.Africa.Africa* 0.095 0.008
Geography.Regions.Africa.Central Africa 0.02 0.001
Geography.Regions.Africa.Eastern Africa 0.007 0
Geography.Regions.Africa.Northern Africa 0.024 0.001
Geography.Regions.Africa.Southern Africa 0.019 0.001
Geography.Regions.Africa.Western Africa 0.011 0.001
Geography.Regions.Americas.Central America 0.026 0.003
Geography.Regions.Americas.North America 0.089 0.064
Geography.Regions.Americas.South America 0.036 0.006
Geography.Regions.Asia.Asia* 0.228 0.045
Geography.Regions.Asia.Central Asia 0.02 0.001
Geography.Regions.Asia.East Asia 0.092 0.011
Geography.Regions.Asia.North Asia 0.028 0.001
Geography.Regions.Asia.South Asia 0.034 0.015
Geography.Regions.Asia.Southeast Asia 0.043 0.006
Geography.Regions.Asia.West Asia 0.036 0.011
Geography.Regions.Europe.Eastern Europe 0.059 0.013
Geography.Regions.Europe.Europe* 0.209 0.076
Geography.Regions.Europe.Northern Europe 0.048 0.031
Geography.Regions.Europe.Southern Europe 0.047 0.013
Geography.Regions.Europe.Western Europe 0.067 0.019
Geography.Regions.Oceania 0.037 0.015
History and Society.Business and economics 0.056 0.01
History and Society.Education 0.026 0.007
History and Society.History 0.076 0.011
History and Society.Military and warfare 0.083 0.014
History and Society.Politics and government 0.076 0.028
History and Society.Society 0.101 0.013
History and Society.Transportation 0.059 0.015
STEM.Biology 0.117 0.034
STEM.Chemistry 0.025 0.002
STEM.Computing 0.04 0.003
STEM.Earth and environment 0.027 0.005
STEM.Engineering 0.05 0.005
STEM.Libraries & Information 0.008 0.001
STEM.Mathematics 0.016 0
STEM.Medicine & Health 0.029 0.006
STEM.Physics 0.023 0.001
STEM.STEM* 0.339 0.069
STEM.Space 0.027 0.006
STEM.Technology 0.07 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.125 0.875 0.895 0.881 0.888 0.972 0.981 0.945
Culture.Biography.Women 0.025 0.975 0.801 0.459 0.584 0.983 0.983 0.57
Culture.Food and drink 0.004 0.996 0.726 0.466 0.567 0.997 0.982 0.594
Culture.Internet culture 0.007 0.993 0.826 0.416 0.553 0.995 0.987 0.736
Culture.Linguistics 0.007 0.993 0.772 0.83 0.8 0.997 0.978 0.821
Culture.Literature 0.021 0.979 0.785 0.574 0.663 0.988 0.98 0.727
Culture.Media.Books 0.005 0.995 0.742 0.616 0.673 0.997 0.985 0.708
Culture.Media.Entertainment 0.005 0.995 0.525 0.364 0.43 0.995 0.97 0.415
Culture.Media.Films 0.011 0.989 0.794 0.739 0.765 0.995 0.984 0.812
Culture.Media.Media* 0.077 0.923 0.877 0.669 0.759 0.967 0.981 0.85
Culture.Media.Music 0.023 0.977 0.863 0.882 0.872 0.994 0.986 0.913
Culture.Media.Radio 0.002 0.998 0.625 0.718 0.668 0.999 0.946 0.506
Culture.Media.Software 0.006 0.994 0.834 0.184 0.301 0.995 0.987 0.359
Culture.Media.Television 0.009 0.991 0.725 0.74 0.733 0.995 0.982 0.737
Culture.Media.Video games 0.003 0.997 0.921 0.723 0.81 0.999 0.992 0.913
Culture.Performing arts 0.004 0.996 0.699 0.534 0.605 0.997 0.98 0.594
Culture.Philosophy and religion 0.011 0.989 0.569 0.545 0.557 0.99 0.962 0.561
Culture.Sports 0.064 0.936 0.881 0.974 0.925 0.99 0.982 0.959
Culture.Visual arts.Architecture 0.01 0.99 0.72 0.73 0.725 0.994 0.981 0.717
Culture.Visual arts.Comics and Anime 0.003 0.997 0.906 0.599 0.722 0.998 0.989 0.805
Culture.Visual arts.Fashion 0.002 0.998 0.808 0.27 0.404 0.998 0.987 0.346
Culture.Visual arts.Visual arts* 0.021 0.979 0.775 0.664 0.715 0.989 0.977 0.764
Geography.Geographical 0.025 0.975 0.647 0.617 0.632 0.982 0.97 0.679
Geography.Regions.Africa.Africa* 0.015 0.985 0.778 0.409 0.536 0.99 0.978 0.611
Geography.Regions.Africa.Central Africa 0.002 0.998 0.668 0.17 0.271 0.998 0.983 0.243
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.479 0.247 0.326 0.999 0.966 0.139
Geography.Regions.Africa.Northern Africa 0.002 0.998 0.703 0.35 0.467 0.998 0.98 0.37
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.677 0.378 0.486 0.998 0.974 0.354
Geography.Regions.Africa.Western Africa 0.001 0.999 0.799 0.445 0.571 0.999 0.982 0.49
Geography.Regions.Americas.Central America 0.004 0.996 0.625 0.513 0.564 0.997 0.975 0.549
Geography.Regions.Americas.North America 0.054 0.946 0.676 0.798 0.732 0.968 0.968 0.807
Geography.Regions.Americas.South America 0.008 0.992 0.772 0.611 0.682 0.995 0.982 0.64
Geography.Regions.Asia.Asia* 0.063 0.937 0.84 0.607 0.705 0.968 0.974 0.79
Geography.Regions.Asia.Central Asia 0.002 0.998 0.755 0.317 0.446 0.998 0.985 0.405
Geography.Regions.Asia.East Asia 0.017 0.983 0.806 0.536 0.644 0.99 0.981 0.668
Geography.Regions.Asia.North Asia 0.005 0.995 0.758 0.15 0.25 0.996 0.983 0.23
Geography.Regions.Asia.South Asia 0.014 0.986 0.797 0.87 0.832 0.995 0.982 0.869
Geography.Regions.Asia.Southeast Asia 0.008 0.992 0.739 0.577 0.648 0.995 0.979 0.688
Geography.Regions.Asia.West Asia 0.011 0.989 0.805 0.832 0.818 0.996 0.981 0.839
Geography.Regions.Europe.Eastern Europe 0.015 0.985 0.799 0.682 0.736 0.993 0.983 0.815
Geography.Regions.Europe.Europe* 0.085 0.915 0.805 0.724 0.762 0.962 0.969 0.832
Geography.Regions.Europe.Northern Europe 0.023 0.977 0.636 0.836 0.723 0.985 0.971 0.782
Geography.Regions.Europe.Southern Europe 0.013 0.987 0.745 0.737 0.741 0.993 0.978 0.801
Geography.Regions.Europe.Western Europe 0.019 0.981 0.78 0.785 0.782 0.992 0.98 0.817
Geography.Regions.Oceania 0.013 0.987 0.742 0.841 0.788 0.994 0.98 0.826
History and Society.Business and economics 0.012 0.988 0.617 0.507 0.557 0.99 0.971 0.56
History and Society.Education 0.007 0.993 0.583 0.659 0.619 0.995 0.973 0.613
History and Society.History 0.016 0.984 0.537 0.367 0.436 0.985 0.957 0.443
History and Society.Military and warfare 0.019 0.981 0.781 0.582 0.667 0.989 0.98 0.759
History and Society.Politics and government 0.024 0.976 0.571 0.657 0.611 0.98 0.956 0.647
History and Society.Society 0.015 0.985 0.485 0.397 0.437 0.984 0.94 0.408
History and Society.Transportation 0.015 0.985 0.921 0.908 0.914 0.997 0.989 0.943
STEM.Biology 0.034 0.966 0.921 0.912 0.917 0.994 0.987 0.959
STEM.Chemistry 0.004 0.996 0.78 0.298 0.431 0.997 0.987 0.48
STEM.Computing 0.007 0.993 0.837 0.302 0.443 0.994 0.988 0.475
STEM.Earth and environment 0.005 0.995 0.678 0.598 0.635 0.996 0.977 0.623
STEM.Engineering 0.007 0.993 0.835 0.663 0.739 0.997 0.985 0.847
STEM.Libraries & Information 0.001 0.999 0.734 0.462 0.567 0.999 0.972 0.53
STEM.Mathematics 0.001 0.999 0.83 0.27 0.407 0.999 0.984 0.51
STEM.Medicine & Health 0.007 0.993 0.711 0.652 0.68 0.996 0.974 0.718
STEM.Physics 0.003 0.997 0.746 0.213 0.332 0.997 0.986 0.322
STEM.STEM* 0.084 0.916 0.916 0.754 0.827 0.974 0.981 0.908
STEM.Space 0.006 0.994 0.907 0.893 0.9 0.999 0.994 0.952
STEM.Technology 0.014 0.986 0.74 0.269 0.395 0.988 0.977 0.446

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "scale": false,
        "center": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "multilabel": true,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.1,
        "loss": "deviance",
        "max_depth": 5,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 150,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false,
        "label_weights": {}
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/viwiki/1234/articletopic

Output:

Example output
{
    "viwiki": {
        "models": {
            "articletopic": {
                "version": "1.4.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "Culture.Linguistics",
                            "Geography.Geographical",
                            "Geography.Regions.Africa.Africa*",
                            "STEM.STEM*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.010688680396213181,
                            "Culture.Biography.Women": 0.008195720460488087,
                            "Culture.Food and drink": 0.00012907532555688536,
                            "Culture.Internet culture": 0.013800940060468312,
                            "Culture.Linguistics": 0.5191168929941612,
                            "Culture.Literature": 0.0025178108493059317,
                            "Culture.Media.Books": 0.000812132490359537,
                            "Culture.Media.Entertainment": 0.0006821999958547265,
                            "Culture.Media.Films": 0.030271891185779207,
                            "Culture.Media.Media*": 0.008323998748143888,
                            "Culture.Media.Music": 0.0002453251096654214,
                            "Culture.Media.Radio": 0.0022820146632477724,
                            "Culture.Media.Software": 0.004399225981860488,
                            "Culture.Media.Television": 0.0006043194680847222,
                            "Culture.Media.Video games": 0.00012009680849040888,
                            "Culture.Performing arts": 0.0004276073083262229,
                            "Culture.Philosophy and religion": 0.0006390607811047581,
                            "Culture.Sports": 0.16329437697514143,
                            "Culture.Visual arts.Architecture": 0.0001462376454522797,
                            "Culture.Visual arts.Comics and Anime": 5.139459619313559e-05,
                            "Culture.Visual arts.Fashion": 0.00010666140674559313,
                            "Culture.Visual arts.Visual arts*": 0.004025427327690172,
                            "Geography.Geographical": 0.9997938301645891,
                            "Geography.Regions.Africa.Africa*": 0.9745448423349992,
                            "Geography.Regions.Africa.Central Africa": 0.0027669019697055064,
                            "Geography.Regions.Africa.Eastern Africa": 0.0012268282092543052,
                            "Geography.Regions.Africa.Northern Africa": 0.0009715033768551983,
                            "Geography.Regions.Africa.Southern Africa": 0.08328049473179015,
                            "Geography.Regions.Africa.Western Africa": 0.010382244078188742,
                            "Geography.Regions.Americas.Central America": 0.0008457881065380281,
                            "Geography.Regions.Americas.North America": 0.0022683028023005257,
                            "Geography.Regions.Americas.South America": 0.004731186807992782,
                            "Geography.Regions.Asia.Asia*": 0.053922826964153146,
                            "Geography.Regions.Asia.Central Asia": 0.0010540191978565911,
                            "Geography.Regions.Asia.East Asia": 0.00015859255536353935,
                            "Geography.Regions.Asia.North Asia": 0.001639170606177736,
                            "Geography.Regions.Asia.South Asia": 0.00037981554072521193,
                            "Geography.Regions.Asia.Southeast Asia": 0.1295969844969788,
                            "Geography.Regions.Asia.West Asia": 0.0015758566417359654,
                            "Geography.Regions.Europe.Eastern Europe": 0.006411703844818901,
                            "Geography.Regions.Europe.Europe*": 0.04052637386454308,
                            "Geography.Regions.Europe.Northern Europe": 0.009986087621959145,
                            "Geography.Regions.Europe.Southern Europe": 0.009068248925919975,
                            "Geography.Regions.Europe.Western Europe": 0.011069800202305278,
                            "Geography.Regions.Oceania": 0.015890609972399248,
                            "History and Society.Business and economics": 0.0993963928524104,
                            "History and Society.Education": 0.06430872652420767,
                            "History and Society.History": 0.03888750510751568,
                            "History and Society.Military and warfare": 0.0011597465887236648,
                            "History and Society.Politics and government": 0.3817495049698054,
                            "History and Society.Society": 0.06829869812261144,
                            "History and Society.Transportation": 0.0014330186028741118,
                            "STEM.Biology": 0.004779911679802516,
                            "STEM.Chemistry": 0.0005351128777525615,
                            "STEM.Computing": 0.1260141460713239,
                            "STEM.Earth and environment": 0.01341887313175579,
                            "STEM.Engineering": 0.021532699737598353,
                            "STEM.Libraries & Information": 0.000143195569809031,
                            "STEM.Mathematics": 0.028134438670921486,
                            "STEM.Medicine & Health": 0.016469415539396397,
                            "STEM.Physics": 0.015386187697739985,
                            "STEM.STEM*": 0.9754797739642432,
                            "STEM.Space": 0.0026360704668903583,
                            "STEM.Technology": 0.05185008787196212
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Vietnamese_Wikipedia_article_topic,
  title={ Vietnamese Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Vietnamese_Wikipedia_article_topic }
}