Jump to content

Machine learning models/Production/Ukrainian Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Ukrainian Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/ukwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 15166 13791 1375 861 45800
Culture.Biography.Women 3789 2889 900 359 57679
Culture.Food and drink 1445 1030 415 86 60296
Culture.Internet culture 3148 2586 562 213 58466
Culture.Linguistics 1672 1192 480 107 60048
Culture.Literature 5363 4083 1280 516 55948
Culture.Media.Books 1832 1430 402 147 59848
Culture.Media.Entertainment 2162 930 1232 283 59382
Culture.Media.Films 2933 2522 411 121 58773
Culture.Media.Media* 13712 11856 1856 1274 46841
Culture.Media.Music 2739 2204 535 207 58881
Culture.Media.Radio 468 186 282 50 61309
Culture.Media.Software 2119 1568 551 357 59351
Culture.Media.Television 1940 1427 513 128 59759
Culture.Media.Video games 2071 1868 203 55 59701
Culture.Performing arts 1433 809 624 131 60263
Culture.Philosophy and religion 3515 1802 1713 370 57942
Culture.Sports 5614 4968 646 182 56031
Culture.Visual arts.Architecture 2163 1497 666 246 59418
Culture.Visual arts.Comics and Anime 1640 1351 289 74 60113
Culture.Visual arts.Fashion 1229 884 345 87 60511
Culture.Visual arts.Visual arts* 5742 4125 1617 466 55619
Geography.Geographical 4540 3096 1444 605 56682
Geography.Regions.Africa.Africa* 5072 3795 1277 443 56312
Geography.Regions.Africa.Central Africa 1164 835 329 86 60577
Geography.Regions.Africa.Eastern Africa 727 541 186 44 61056
Geography.Regions.Africa.Northern Africa 1422 882 540 128 60277
Geography.Regions.Africa.Southern Africa 905 541 364 57 60865
Geography.Regions.Africa.Western Africa 370 215 155 46 61411
Geography.Regions.Americas.Central America 1329 788 541 77 60421
Geography.Regions.Americas.North America 5433 3333 2100 774 55620
Geography.Regions.Americas.South America 1517 998 519 91 60219
Geography.Regions.Asia.Asia* 12648 10108 2540 1153 48026
Geography.Regions.Asia.Central Asia 1310 880 430 98 60419
Geography.Regions.Asia.East Asia 2929 2187 742 242 58656
Geography.Regions.Asia.North Asia 3840 2607 1233 493 57494
Geography.Regions.Asia.South Asia 1742 1282 460 68 60017
Geography.Regions.Asia.Southeast Asia 1528 1029 499 79 60220
Geography.Regions.Asia.West Asia 2514 1823 691 176 59137
Geography.Regions.Europe.Eastern Europe 6703 5086 1617 706 54418
Geography.Regions.Europe.Europe* 17567 13864 3703 2331 41929
Geography.Regions.Europe.Northern Europe 3575 2180 1395 387 57865
Geography.Regions.Europe.Southern Europe 3636 2526 1110 335 57856
Geography.Regions.Europe.Western Europe 4437 3195 1242 473 56917
Geography.Regions.Oceania 1810 1267 543 119 59898
History and Society.Business and economics 3639 2090 1549 466 57722
History and Society.Education 1730 892 838 151 59946
History and Society.History 5140 2835 2305 718 55969
History and Society.Military and warfare 4576 3054 1522 543 56708
History and Society.Politics and government 4178 2243 1935 473 57176
History and Society.Society 3965 1044 2921 253 57609
History and Society.Transportation 3659 3233 426 159 58009
STEM.Biology 3859 3183 676 189 57779
STEM.Chemistry 1465 1023 442 190 60172
STEM.Computing 2683 2099 584 433 58711
STEM.Earth and environment 1966 1214 752 170 59691
STEM.Engineering 3036 2298 738 226 58565
STEM.Libraries & Information 1052 700 352 69 60706
STEM.Mathematics 1217 897 320 91 60519
STEM.Medicine & Health 1842 1218 624 173 59812
STEM.Physics 1449 937 512 205 60173
STEM.STEM* 18828 16748 2080 1108 41891
STEM.Space 1841 1627 214 62 59924
STEM.Technology 4780 3320 1460 779 56268

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.245 0.12
Culture.Biography.Women 0.061 0.015
Culture.Food and drink 0.023 0.003
Culture.Internet culture 0.051 0.004
Culture.Linguistics 0.027 0.008
Culture.Literature 0.087 0.015
Culture.Media.Books 0.03 0.004
Culture.Media.Entertainment 0.035 0.004
Culture.Media.Films 0.047 0.012
Culture.Media.Media* 0.222 0.055
Culture.Media.Music 0.044 0.021
Culture.Media.Radio 0.008 0.002
Culture.Media.Software 0.034 0.001
Culture.Media.Television 0.031 0.009
Culture.Media.Video games 0.033 0.003
Culture.Performing arts 0.023 0.003
Culture.Philosophy and religion 0.057 0.01
Culture.Sports 0.091 0.06
Culture.Visual arts.Architecture 0.035 0.011
Culture.Visual arts.Comics and Anime 0.027 0.002
Culture.Visual arts.Fashion 0.02 0.001
Culture.Visual arts.Visual arts* 0.093 0.018
Geography.Geographical 0.073 0.021
Geography.Regions.Africa.Africa* 0.082 0.008
Geography.Regions.Africa.Central Africa 0.019 0.001
Geography.Regions.Africa.Eastern Africa 0.012 0.001
Geography.Regions.Africa.Northern Africa 0.023 0.001
Geography.Regions.Africa.Southern Africa 0.015 0.001
Geography.Regions.Africa.Western Africa 0.006 0.001
Geography.Regions.Americas.Central America 0.021 0.003
Geography.Regions.Americas.North America 0.088 0.063
Geography.Regions.Americas.South America 0.025 0.007
Geography.Regions.Asia.Asia* 0.205 0.052
Geography.Regions.Asia.Central Asia 0.021 0.001
Geography.Regions.Asia.East Asia 0.047 0.012
Geography.Regions.Asia.North Asia 0.062 0.006
Geography.Regions.Asia.South Asia 0.028 0.016
Geography.Regions.Asia.Southeast Asia 0.025 0.006
Geography.Regions.Asia.West Asia 0.041 0.012
Geography.Regions.Europe.Eastern Europe 0.108 0.018
Geography.Regions.Europe.Europe* 0.284 0.081
Geography.Regions.Europe.Northern Europe 0.058 0.029
Geography.Regions.Europe.Southern Europe 0.059 0.014
Geography.Regions.Europe.Western Europe 0.072 0.02
Geography.Regions.Oceania 0.029 0.016
History and Society.Business and economics 0.059 0.01
History and Society.Education 0.028 0.008
History and Society.History 0.083 0.011
History and Society.Military and warfare 0.074 0.015
History and Society.Politics and government 0.068 0.028
History and Society.Society 0.064 0.008
History and Society.Transportation 0.059 0.016
STEM.Biology 0.062 0.034
STEM.Chemistry 0.024 0.002
STEM.Computing 0.043 0.003
STEM.Earth and environment 0.032 0.005
STEM.Engineering 0.049 0.006
STEM.Libraries & Information 0.017 0.001
STEM.Mathematics 0.02 0
STEM.Medicine & Health 0.03 0.006
STEM.Physics 0.023 0.001
STEM.STEM* 0.305 0.065
STEM.Space 0.03 0.004
STEM.Technology 0.077 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.125 0.875 0.909 0.871 0.89 0.973 0.981 0.944
Culture.Biography.Women 0.017 0.983 0.762 0.651 0.703 0.99 0.982 0.715
Culture.Food and drink 0.003 0.997 0.713 0.561 0.628 0.998 0.981 0.636
Culture.Internet culture 0.007 0.993 0.821 0.458 0.588 0.996 0.984 0.749
Culture.Linguistics 0.007 0.993 0.713 0.764 0.738 0.996 0.976 0.759
Culture.Literature 0.02 0.98 0.761 0.556 0.643 0.987 0.977 0.707
Culture.Media.Books 0.006 0.994 0.781 0.58 0.665 0.997 0.986 0.733
Culture.Media.Entertainment 0.006 0.994 0.43 0.264 0.327 0.993 0.966 0.256
Culture.Media.Films 0.012 0.988 0.86 0.83 0.845 0.996 0.985 0.888
Culture.Media.Media* 0.072 0.928 0.865 0.654 0.745 0.968 0.979 0.854
Culture.Media.Music 0.02 0.98 0.805 0.831 0.818 0.992 0.982 0.841
Culture.Media.Radio 0.002 0.998 0.397 0.531 0.455 0.998 0.959 0.284
Culture.Media.Software 0.007 0.993 0.74 0.139 0.234 0.994 0.985 0.243
Culture.Media.Television 0.009 0.991 0.736 0.755 0.745 0.996 0.981 0.772
Culture.Media.Video games 0.004 0.996 0.902 0.74 0.813 0.999 0.991 0.876
Culture.Performing arts 0.004 0.996 0.565 0.443 0.496 0.997 0.973 0.473
Culture.Philosophy and religion 0.012 0.988 0.513 0.458 0.484 0.989 0.952 0.427
Culture.Sports 0.056 0.944 0.885 0.946 0.914 0.99 0.981 0.948
Culture.Visual arts.Architecture 0.012 0.988 0.692 0.646 0.668 0.993 0.976 0.665
Culture.Visual arts.Comics and Anime 0.003 0.997 0.824 0.616 0.705 0.998 0.987 0.7
Culture.Visual arts.Fashion 0.002 0.998 0.719 0.31 0.433 0.998 0.979 0.372
Culture.Visual arts.Visual arts* 0.021 0.979 0.718 0.614 0.662 0.987 0.97 0.703
Geography.Geographical 0.025 0.975 0.682 0.582 0.628 0.983 0.971 0.669
Geography.Regions.Africa.Africa* 0.014 0.986 0.748 0.45 0.562 0.99 0.974 0.614
Geography.Regions.Africa.Central Africa 0.002 0.998 0.717 0.262 0.384 0.998 0.984 0.315
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.744 0.341 0.468 0.999 0.976 0.487
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.62 0.283 0.389 0.997 0.975 0.327
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.598 0.455 0.517 0.999 0.967 0.39
Geography.Regions.Africa.Western Africa 0.001 0.999 0.581 0.366 0.449 0.999 0.964 0.281
Geography.Regions.Americas.Central America 0.003 0.997 0.593 0.617 0.605 0.997 0.974 0.592
Geography.Regions.Americas.North America 0.051 0.949 0.613 0.75 0.675 0.963 0.962 0.744
Geography.Regions.Americas.South America 0.006 0.994 0.658 0.75 0.701 0.996 0.974 0.73
Geography.Regions.Asia.Asia* 0.064 0.936 0.799 0.654 0.719 0.967 0.966 0.794
Geography.Regions.Asia.Central Asia 0.002 0.998 0.672 0.248 0.362 0.998 0.982 0.33
Geography.Regions.Asia.East Asia 0.013 0.987 0.747 0.691 0.718 0.993 0.978 0.771
Geography.Regions.Asia.North Asia 0.012 0.988 0.679 0.31 0.426 0.99 0.971 0.436
Geography.Regions.Asia.South Asia 0.013 0.987 0.736 0.916 0.816 0.995 0.981 0.872
Geography.Regions.Asia.Southeast Asia 0.005 0.995 0.673 0.763 0.715 0.997 0.974 0.705
Geography.Regions.Asia.West Asia 0.011 0.989 0.725 0.743 0.734 0.994 0.979 0.769
Geography.Regions.Europe.Eastern Europe 0.026 0.974 0.759 0.524 0.62 0.983 0.971 0.677
Geography.Regions.Europe.Europe* 0.112 0.888 0.789 0.568 0.661 0.935 0.952 0.758
Geography.Regions.Europe.Northern Europe 0.024 0.976 0.61 0.732 0.665 0.982 0.964 0.714
Geography.Regions.Europe.Southern Europe 0.015 0.985 0.695 0.628 0.659 0.99 0.976 0.708
Geography.Regions.Europe.Western Europe 0.023 0.977 0.72 0.646 0.681 0.986 0.973 0.723
Geography.Regions.Oceania 0.013 0.987 0.7 0.855 0.77 0.993 0.979 0.82
History and Society.Business and economics 0.014 0.986 0.574 0.413 0.48 0.988 0.963 0.458
History and Society.Education 0.007 0.993 0.516 0.621 0.563 0.994 0.965 0.558
History and Society.History 0.019 0.981 0.552 0.324 0.408 0.983 0.951 0.375
History and Society.Military and warfare 0.02 0.98 0.667 0.522 0.586 0.986 0.968 0.626
History and Society.Politics and government 0.023 0.977 0.537 0.653 0.589 0.979 0.951 0.61
History and Society.Society 0.007 0.993 0.263 0.334 0.294 0.99 0.9 0.238
History and Society.Transportation 0.017 0.983 0.884 0.842 0.862 0.995 0.986 0.898
STEM.Biology 0.032 0.968 0.825 0.9 0.861 0.991 0.979 0.9
STEM.Chemistry 0.004 0.996 0.698 0.273 0.392 0.996 0.982 0.429
STEM.Computing 0.01 0.99 0.782 0.232 0.358 0.992 0.987 0.357
STEM.Earth and environment 0.006 0.994 0.617 0.508 0.557 0.995 0.969 0.545
STEM.Engineering 0.008 0.992 0.757 0.532 0.625 0.995 0.98 0.713
STEM.Libraries & Information 0.002 0.998 0.665 0.286 0.4 0.999 0.978 0.343
STEM.Mathematics 0.002 0.998 0.737 0.185 0.295 0.998 0.986 0.338
STEM.Medicine & Health 0.007 0.993 0.661 0.597 0.627 0.995 0.972 0.596
STEM.Physics 0.004 0.996 0.647 0.151 0.245 0.996 0.982 0.182
STEM.STEM* 0.082 0.918 0.89 0.705 0.787 0.969 0.977 0.882
STEM.Space 0.005 0.995 0.884 0.785 0.831 0.998 0.991 0.904
STEM.Technology 0.017 0.983 0.695 0.208 0.32 0.985 0.972 0.379

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "verbose": 0,
        "presort": "deprecated",
        "n_iter_no_change": null,
        "multilabel": true,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "center": false,
        "warm_start": false,
        "min_samples_leaf": 1,
        "loss": "deviance",
        "max_leaf_nodes": null,
        "learning_rate": 0.1,
        "random_state": null,
        "label_weights": {},
        "subsample": 1.0,
        "validation_fraction": 0.1,
        "scale": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "max_depth": 5,
        "min_weight_fraction_leaf": 0.0,
        "min_samples_split": 2,
        "tol": 0.0001,
        "min_impurity_decrease": 0.0,
        "n_estimators": 150,
        "max_features": "log2",
        "population_rates": null,
        "init": null,
        "min_impurity_split": null
    }
}
Output schema
Output schema
{
    "type": "object",
    "properties": {
        "prediction": {
            "type": "array",
            "description": "The most likely labels predicted by the estimator",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "type": "object",
            "description": "A mapping of probabilities onto each of the potential output labels",
            "properties": {
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                }
            }
        }
    },
    "title": "Scikit learn-based classifier score with probability"
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/ukwiki/1234/articletopic

Output:

Example output
{
    "ukwiki": {
        "models": {
            "articletopic": {
                "version": "1.3.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "Culture.Media.Entertainment",
                            "Culture.Media.Films",
                            "Culture.Media.Media*",
                            "STEM.STEM*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.06498980629164289,
                            "Culture.Biography.Women": 0.09735683417742311,
                            "Culture.Food and drink": 0.000523699699397982,
                            "Culture.Internet culture": 0.0020378622899029177,
                            "Culture.Linguistics": 0.0016311998268999513,
                            "Culture.Literature": 0.012105139018247661,
                            "Culture.Media.Books": 0.003154971196247378,
                            "Culture.Media.Entertainment": 0.5970285574411643,
                            "Culture.Media.Films": 0.6098774286112809,
                            "Culture.Media.Media*": 0.6813965808444392,
                            "Culture.Media.Music": 0.00819183087138225,
                            "Culture.Media.Radio": 0.033328047128739664,
                            "Culture.Media.Software": 0.0036038682836095704,
                            "Culture.Media.Television": 0.015975115428127188,
                            "Culture.Media.Video games": 0.0002729242274025432,
                            "Culture.Performing arts": 0.010021068609200184,
                            "Culture.Philosophy and religion": 0.08107947551534797,
                            "Culture.Sports": 0.01126554934508554,
                            "Culture.Visual arts.Architecture": 0.007503543006049268,
                            "Culture.Visual arts.Comics and Anime": 0.010089193188944407,
                            "Culture.Visual arts.Fashion": 0.0003908147416162821,
                            "Culture.Visual arts.Visual arts*": 0.06032534559763305,
                            "Geography.Geographical": 0.012344109368572971,
                            "Geography.Regions.Africa.Africa*": 0.012560789546965578,
                            "Geography.Regions.Africa.Central Africa": 0.0010803105638953752,
                            "Geography.Regions.Africa.Eastern Africa": 0.00011246776562541031,
                            "Geography.Regions.Africa.Northern Africa": 0.009624298316990852,
                            "Geography.Regions.Africa.Southern Africa": 0.0004266187714493931,
                            "Geography.Regions.Africa.Western Africa": 9.969668607552113e-06,
                            "Geography.Regions.Americas.Central America": 0.06118223784802867,
                            "Geography.Regions.Americas.North America": 0.03925429350174892,
                            "Geography.Regions.Americas.South America": 0.05219719485708542,
                            "Geography.Regions.Asia.Asia*": 0.008542578261190403,
                            "Geography.Regions.Asia.Central Asia": 0.00026849369255479776,
                            "Geography.Regions.Asia.East Asia": 0.0002861807464576771,
                            "Geography.Regions.Asia.North Asia": 0.0011599087617587605,
                            "Geography.Regions.Asia.South Asia": 0.0015128573492686713,
                            "Geography.Regions.Asia.Southeast Asia": 0.002553083994766201,
                            "Geography.Regions.Asia.West Asia": 0.0005711962419297393,
                            "Geography.Regions.Europe.Eastern Europe": 0.0037089892492006117,
                            "Geography.Regions.Europe.Europe*": 0.27058500017040055,
                            "Geography.Regions.Europe.Northern Europe": 0.008797312001829935,
                            "Geography.Regions.Europe.Southern Europe": 0.0068022456360417195,
                            "Geography.Regions.Europe.Western Europe": 0.08573292886089152,
                            "Geography.Regions.Oceania": 0.010721904050560125,
                            "History and Society.Business and economics": 0.0022774846472204424,
                            "History and Society.Education": 0.023723447768514965,
                            "History and Society.History": 0.016452406006402392,
                            "History and Society.Military and warfare": 0.003573186866768048,
                            "History and Society.Politics and government": 0.007725262244322175,
                            "History and Society.Society": 0.01170722296370807,
                            "History and Society.Transportation": 0.0008191284679530242,
                            "STEM.Biology": 0.020667912361522288,
                            "STEM.Chemistry": 0.000310701475054592,
                            "STEM.Computing": 0.0026236938213711316,
                            "STEM.Earth and environment": 0.16856390174271477,
                            "STEM.Engineering": 0.0014354375119114869,
                            "STEM.Libraries & Information": 0.0040652121528858476,
                            "STEM.Mathematics": 0.0007447907192789808,
                            "STEM.Medicine & Health": 0.0025102595896690613,
                            "STEM.Physics": 0.001325294456067261,
                            "STEM.STEM*": 0.8558854466981729,
                            "STEM.Space": 0.0004964062174690077,
                            "STEM.Technology": 0.008370450540204109
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Ukrainian_Wikipedia_article_topic,
  title={ Ukrainian Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Ukrainian_Wikipedia_article_topic }
}