Jump to content

Machine learning models/Production/Hungarian Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Hungarian Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/huwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 15526 14154 1372 779 43039
Culture.Biography.Women 3586 2498 1088 353 55405
Culture.Food and drink 1513 985 528 131 57700
Culture.Internet culture 3266 2754 512 198 55880
Culture.Linguistics 1643 1170 473 97 57604
Culture.Literature 5570 4162 1408 514 53260
Culture.Media.Books 1732 1384 348 105 57507
Culture.Media.Entertainment 2340 1064 1276 284 56720
Culture.Media.Films 2888 2403 485 145 56311
Culture.Media.Media* 14392 12706 1686 1219 43733
Culture.Media.Music 3309 2800 509 246 55789
Culture.Media.Radio 330 161 169 15 58999
Culture.Media.Software 2170 1752 418 274 56900
Culture.Media.Television 2469 1986 483 136 56739
Culture.Media.Video games 2087 1941 146 48 57209
Culture.Performing arts 1491 948 543 114 57739
Culture.Philosophy and religion 4003 2356 1647 417 54924
Culture.Sports 6230 5587 643 179 52935
Culture.Visual arts.Architecture 2030 1416 614 250 57064
Culture.Visual arts.Comics and Anime 1854 1526 328 95 57395
Culture.Visual arts.Fashion 806 485 321 54 58484
Culture.Visual arts.Visual arts* 5379 3710 1669 421 53544
Geography.Geographical 3887 2579 1308 537 54920
Geography.Regions.Africa.Africa* 4049 2762 1287 331 54964
Geography.Regions.Africa.Central Africa 880 570 310 72 58392
Geography.Regions.Africa.Eastern Africa 369 233 136 41 58934
Geography.Regions.Africa.Northern Africa 1503 940 563 126 57715
Geography.Regions.Africa.Southern Africa 695 515 180 47 58602
Geography.Regions.Africa.Western Africa 155 64 91 29 59160
Geography.Regions.Americas.Central America 1381 789 592 102 57861
Geography.Regions.Americas.North America 5628 3631 1997 930 52786
Geography.Regions.Americas.South America 1610 1097 513 126 57608
Geography.Regions.Asia.Asia* 10749 8632 2117 945 47650
Geography.Regions.Asia.Central Asia 1349 938 411 103 57892
Geography.Regions.Asia.East Asia 3424 2703 721 219 55701
Geography.Regions.Asia.North Asia 1888 1385 503 229 57227
Geography.Regions.Asia.South Asia 1826 1268 558 110 57408
Geography.Regions.Asia.Southeast Asia 1601 974 627 116 57627
Geography.Regions.Asia.West Asia 2495 1794 701 200 56649
Geography.Regions.Europe.Eastern Europe 4951 3665 1286 556 53837
Geography.Regions.Europe.Europe* 16846 13610 3236 2259 40239
Geography.Regions.Europe.Northern Europe 3617 2386 1231 378 55349
Geography.Regions.Europe.Southern Europe 4056 2998 1058 359 54929
Geography.Regions.Europe.Western Europe 5217 4004 1213 551 53576
Geography.Regions.Oceania 1853 1310 543 121 57370
History and Society.Business and economics 3192 1728 1464 324 55828
History and Society.Education 1612 735 877 165 57567
History and Society.History 6143 3880 2263 739 52462
History and Society.Military and warfare 5039 3575 1464 471 53834
History and Society.Politics and government 4453 2632 1821 436 54455
History and Society.Society 6223 2897 3326 552 52569
History and Society.Transportation 4094 3714 380 159 55091
STEM.Biology 4751 4126 625 137 54456
STEM.Chemistry 1620 1312 308 166 57558
STEM.Computing 2449 1986 463 283 56612
STEM.Earth and environment 1882 1335 547 131 57331
STEM.Engineering 2505 1786 719 215 56624
STEM.Libraries & Information 510 289 221 38 58796
STEM.Mathematics 990 785 205 63 58291
STEM.Medicine & Health 1939 1405 534 160 57245
STEM.Physics 1419 958 461 166 57759
STEM.STEM* 18523 16670 1853 917 39904
STEM.Space 2372 2227 145 42 56930
STEM.Technology 4874 3669 1205 597 53873

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.262 0.123
Culture.Biography.Women 0.06 0.015
Culture.Food and drink 0.025 0.002
Culture.Internet culture 0.055 0.003
Culture.Linguistics 0.028 0.007
Culture.Literature 0.094 0.015
Culture.Media.Books 0.029 0.004
Culture.Media.Entertainment 0.039 0.004
Culture.Media.Films 0.049 0.011
Culture.Media.Media* 0.243 0.058
Culture.Media.Music 0.056 0.024
Culture.Media.Radio 0.006 0.002
Culture.Media.Software 0.037 0.001
Culture.Media.Television 0.042 0.009
Culture.Media.Video games 0.035 0.003
Culture.Performing arts 0.025 0.003
Culture.Philosophy and religion 0.067 0.011
Culture.Sports 0.105 0.071
Culture.Visual arts.Architecture 0.034 0.011
Culture.Visual arts.Comics and Anime 0.031 0.002
Culture.Visual arts.Fashion 0.014 0.001
Culture.Visual arts.Visual arts* 0.091 0.018
Geography.Geographical 0.065 0.024
Geography.Regions.Africa.Africa* 0.068 0.008
Geography.Regions.Africa.Central Africa 0.015 0.001
Geography.Regions.Africa.Eastern Africa 0.006 0
Geography.Regions.Africa.Northern Africa 0.025 0.001
Geography.Regions.Africa.Southern Africa 0.012 0.001
Geography.Regions.Africa.Western Africa 0.003 0.001
Geography.Regions.Americas.Central America 0.023 0.003
Geography.Regions.Americas.North America 0.095 0.064
Geography.Regions.Americas.South America 0.027 0.006
Geography.Regions.Asia.Asia* 0.181 0.045
Geography.Regions.Asia.Central Asia 0.023 0.001
Geography.Regions.Asia.East Asia 0.058 0.011
Geography.Regions.Asia.North Asia 0.032 0.001
Geography.Regions.Asia.South Asia 0.031 0.015
Geography.Regions.Asia.Southeast Asia 0.027 0.006
Geography.Regions.Asia.West Asia 0.042 0.011
Geography.Regions.Europe.Eastern Europe 0.083 0.013
Geography.Regions.Europe.Europe* 0.284 0.076
Geography.Regions.Europe.Northern Europe 0.061 0.031
Geography.Regions.Europe.Southern Europe 0.068 0.013
Geography.Regions.Europe.Western Europe 0.088 0.019
Geography.Regions.Oceania 0.031 0.015
History and Society.Business and economics 0.054 0.01
History and Society.Education 0.027 0.007
History and Society.History 0.104 0.011
History and Society.Military and warfare 0.085 0.014
History and Society.Politics and government 0.075 0.028
History and Society.Society 0.105 0.013
History and Society.Transportation 0.069 0.015
STEM.Biology 0.08 0.034
STEM.Chemistry 0.027 0.002
STEM.Computing 0.041 0.003
STEM.Earth and environment 0.032 0.005
STEM.Engineering 0.042 0.005
STEM.Libraries & Information 0.009 0.001
STEM.Mathematics 0.017 0
STEM.Medicine & Health 0.033 0.006
STEM.Physics 0.024 0.001
STEM.STEM* 0.312 0.069
STEM.Space 0.04 0.006
STEM.Technology 0.082 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.128 0.872 0.912 0.878 0.895 0.974 0.982 0.953
Culture.Biography.Women 0.016 0.984 0.697 0.619 0.656 0.989 0.978 0.687
Culture.Food and drink 0.004 0.996 0.651 0.415 0.507 0.997 0.976 0.477
Culture.Internet culture 0.006 0.994 0.843 0.456 0.592 0.996 0.985 0.678
Culture.Linguistics 0.007 0.993 0.712 0.758 0.734 0.996 0.973 0.719
Culture.Literature 0.021 0.979 0.747 0.552 0.635 0.987 0.977 0.722
Culture.Media.Books 0.005 0.995 0.799 0.639 0.71 0.997 0.984 0.798
Culture.Media.Entertainment 0.007 0.993 0.455 0.247 0.32 0.993 0.965 0.226
Culture.Media.Films 0.011 0.989 0.832 0.775 0.802 0.996 0.984 0.844
Culture.Media.Media* 0.077 0.923 0.883 0.669 0.761 0.968 0.98 0.866
Culture.Media.Music 0.025 0.975 0.846 0.825 0.836 0.992 0.985 0.88
Culture.Media.Radio 0.001 0.999 0.488 0.806 0.608 0.999 0.938 0.438
Culture.Media.Software 0.006 0.994 0.807 0.183 0.298 0.995 0.987 0.347
Culture.Media.Television 0.009 0.991 0.804 0.749 0.776 0.996 0.985 0.807
Culture.Media.Video games 0.003 0.997 0.93 0.744 0.827 0.999 0.99 0.857
Culture.Performing arts 0.004 0.996 0.636 0.483 0.549 0.997 0.976 0.575
Culture.Philosophy and religion 0.014 0.986 0.589 0.457 0.514 0.988 0.958 0.495
Culture.Sports 0.067 0.933 0.897 0.953 0.924 0.99 0.981 0.952
Culture.Visual arts.Architecture 0.012 0.988 0.698 0.631 0.662 0.992 0.979 0.683
Culture.Visual arts.Comics and Anime 0.003 0.997 0.823 0.523 0.64 0.998 0.987 0.701
Culture.Visual arts.Fashion 0.001 0.999 0.602 0.346 0.439 0.999 0.97 0.303
Culture.Visual arts.Visual arts* 0.02 0.98 0.69 0.622 0.654 0.987 0.969 0.667
Geography.Geographical 0.025 0.975 0.663 0.623 0.643 0.983 0.971 0.66
Geography.Regions.Africa.Africa* 0.011 0.989 0.682 0.472 0.558 0.992 0.971 0.534
Geography.Regions.Africa.Central Africa 0.002 0.998 0.648 0.249 0.36 0.999 0.981 0.266
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.631 0.292 0.4 0.999 0.96 0.195
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.625 0.261 0.368 0.997 0.975 0.336
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.741 0.521 0.612 0.999 0.976 0.51
Geography.Regions.Africa.Western Africa 0.001 0.999 0.413 0.366 0.388 0.999 0.881 0.24
Geography.Regions.Americas.Central America 0.004 0.996 0.571 0.518 0.543 0.997 0.969 0.442
Geography.Regions.Americas.North America 0.058 0.942 0.645 0.719 0.68 0.961 0.963 0.752
Geography.Regions.Americas.South America 0.006 0.994 0.681 0.665 0.673 0.996 0.975 0.663
Geography.Regions.Asia.Asia* 0.055 0.945 0.803 0.663 0.726 0.972 0.971 0.794
Geography.Regions.Asia.Central Asia 0.002 0.998 0.695 0.253 0.371 0.998 0.98 0.305
Geography.Regions.Asia.East Asia 0.013 0.987 0.789 0.699 0.742 0.994 0.98 0.799
Geography.Regions.Asia.North Asia 0.005 0.995 0.734 0.145 0.243 0.996 0.984 0.202
Geography.Regions.Asia.South Asia 0.012 0.988 0.694 0.848 0.764 0.993 0.973 0.796
Geography.Regions.Asia.Southeast Asia 0.006 0.994 0.608 0.647 0.627 0.996 0.972 0.578
Geography.Regions.Asia.West Asia 0.011 0.989 0.719 0.693 0.706 0.993 0.977 0.73
Geography.Regions.Europe.Eastern Europe 0.02 0.98 0.74 0.485 0.586 0.987 0.975 0.616
Geography.Regions.Europe.Europe* 0.111 0.889 0.808 0.556 0.659 0.936 0.958 0.762
Geography.Regions.Europe.Northern Europe 0.027 0.973 0.66 0.754 0.704 0.983 0.971 0.767
Geography.Regions.Europe.Southern Europe 0.016 0.984 0.739 0.601 0.663 0.99 0.974 0.714
Geography.Regions.Europe.Western Europe 0.025 0.975 0.767 0.596 0.671 0.986 0.979 0.755
Geography.Regions.Oceania 0.013 0.987 0.707 0.838 0.767 0.993 0.976 0.803
History and Society.Business and economics 0.011 0.989 0.541 0.489 0.514 0.99 0.958 0.472
History and Society.Education 0.006 0.994 0.456 0.542 0.495 0.993 0.961 0.432
History and Society.History 0.021 0.979 0.632 0.332 0.436 0.982 0.961 0.486
History and Society.Military and warfare 0.019 0.981 0.709 0.539 0.612 0.987 0.973 0.668
History and Society.Politics and government 0.024 0.976 0.591 0.683 0.634 0.981 0.954 0.662
History and Society.Society 0.016 0.984 0.466 0.364 0.409 0.983 0.93 0.37
History and Society.Transportation 0.017 0.983 0.907 0.828 0.866 0.996 0.986 0.905
STEM.Biology 0.032 0.968 0.868 0.923 0.895 0.993 0.982 0.931
STEM.Chemistry 0.004 0.996 0.81 0.305 0.443 0.997 0.984 0.544
STEM.Computing 0.007 0.993 0.811 0.305 0.444 0.995 0.986 0.489
STEM.Earth and environment 0.005 0.995 0.709 0.586 0.642 0.996 0.973 0.648
STEM.Engineering 0.007 0.993 0.713 0.498 0.586 0.995 0.979 0.624
STEM.Libraries & Information 0.001 0.999 0.567 0.353 0.435 0.999 0.963 0.22
STEM.Mathematics 0.001 0.999 0.793 0.234 0.362 0.999 0.981 0.389
STEM.Medicine & Health 0.007 0.993 0.725 0.626 0.672 0.995 0.979 0.648
STEM.Physics 0.003 0.997 0.675 0.167 0.267 0.997 0.981 0.168
STEM.STEM* 0.083 0.917 0.9 0.748 0.817 0.972 0.978 0.896
STEM.Space 0.006 0.994 0.939 0.885 0.911 0.999 0.994 0.963
STEM.Technology 0.015 0.985 0.753 0.262 0.388 0.988 0.977 0.51

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "scale": false,
        "center": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "multilabel": true,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.1,
        "loss": "deviance",
        "max_depth": 5,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 150,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false,
        "label_weights": {}
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/huwiki/1234/articletopic

Output:

Example output
{
    "huwiki": {
        "models": {
            "articletopic": {
                "version": "1.4.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "Geography.Regions.Asia.East Asia",
                            "Geography.Regions.Europe.Europe*",
                            "STEM.STEM*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.3423533920926803,
                            "Culture.Biography.Women": 0.14025995403786506,
                            "Culture.Food and drink": 0.06340726339907797,
                            "Culture.Internet culture": 0.011131262002287472,
                            "Culture.Linguistics": 0.04803618594888942,
                            "Culture.Literature": 0.022006683732088216,
                            "Culture.Media.Books": 0.002187089040235113,
                            "Culture.Media.Entertainment": 0.15309575643241796,
                            "Culture.Media.Films": 0.034026455544947425,
                            "Culture.Media.Media*": 0.20341149710773634,
                            "Culture.Media.Music": 0.0497533083246081,
                            "Culture.Media.Radio": 0.32875096388132496,
                            "Culture.Media.Software": 0.013034094135796932,
                            "Culture.Media.Television": 0.0031210534375715336,
                            "Culture.Media.Video games": 0.00030343872173972046,
                            "Culture.Performing arts": 0.005608603834940765,
                            "Culture.Philosophy and religion": 0.08066624396157131,
                            "Culture.Sports": 0.06433894067592255,
                            "Culture.Visual arts.Architecture": 0.03750949389175476,
                            "Culture.Visual arts.Comics and Anime": 0.0064514547243535985,
                            "Culture.Visual arts.Fashion": 0.05307244012171412,
                            "Culture.Visual arts.Visual arts*": 0.11900694660877364,
                            "Geography.Geographical": 0.032374625359398314,
                            "Geography.Regions.Africa.Africa*": 0.15894948721508503,
                            "Geography.Regions.Africa.Central Africa": 0.016011778190990485,
                            "Geography.Regions.Africa.Eastern Africa": 0.004741053756007876,
                            "Geography.Regions.Africa.Northern Africa": 0.0244144581524136,
                            "Geography.Regions.Africa.Southern Africa": 0.0019768365502830605,
                            "Geography.Regions.Africa.Western Africa": 3.962206086248686e-05,
                            "Geography.Regions.Americas.Central America": 0.00522749251093348,
                            "Geography.Regions.Americas.North America": 0.014687908835933414,
                            "Geography.Regions.Americas.South America": 0.002403901278264949,
                            "Geography.Regions.Asia.Asia*": 0.20487191607636618,
                            "Geography.Regions.Asia.Central Asia": 0.09112897244334056,
                            "Geography.Regions.Asia.East Asia": 0.9998386100569907,
                            "Geography.Regions.Asia.North Asia": 0.0027527477451360595,
                            "Geography.Regions.Asia.South Asia": 0.020285604017724872,
                            "Geography.Regions.Asia.Southeast Asia": 0.03250982686466025,
                            "Geography.Regions.Asia.West Asia": 0.05840215484971228,
                            "Geography.Regions.Europe.Eastern Europe": 0.03298542948603228,
                            "Geography.Regions.Europe.Europe*": 0.5315974960652942,
                            "Geography.Regions.Europe.Northern Europe": 0.007914585235952828,
                            "Geography.Regions.Europe.Southern Europe": 0.09252560636319572,
                            "Geography.Regions.Europe.Western Europe": 0.021413181244936737,
                            "Geography.Regions.Oceania": 0.026964542276338526,
                            "History and Society.Business and economics": 0.11345817838571329,
                            "History and Society.Education": 0.018546943483473882,
                            "History and Society.History": 0.16482995511902204,
                            "History and Society.Military and warfare": 0.057955265993041155,
                            "History and Society.Politics and government": 0.0422117254411273,
                            "History and Society.Society": 0.20169635418355597,
                            "History and Society.Transportation": 0.015610557061165244,
                            "STEM.Biology": 0.037673572379875016,
                            "STEM.Chemistry": 0.010794729056374292,
                            "STEM.Computing": 0.04398691739762987,
                            "STEM.Earth and environment": 0.0039011304821686735,
                            "STEM.Engineering": 0.16717936084988483,
                            "STEM.Libraries & Information": 0.022633922980662434,
                            "STEM.Mathematics": 0.08494932357374003,
                            "STEM.Medicine & Health": 0.010577875534037367,
                            "STEM.Physics": 0.0009472592704866106,
                            "STEM.STEM*": 0.7982423641999686,
                            "STEM.Space": 0.003438360400580638,
                            "STEM.Technology": 0.08261264399841732
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Hungarian_Wikipedia_article_topic,
  title={ Hungarian Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Hungarian_Wikipedia_article_topic }
}