Jump to content

Machine learning models/Production/Czech Wikipedia article topic

From Meta, a Wikimedia project coordination wiki


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Czech Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

[edit]

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

[edit]
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/cswiki/1234/articletopic

Ethical considerations, caveats, and recommendations

[edit]
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

[edit]

Performance

[edit]

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 16411 15081 1330 830 42956
Culture.Biography.Women 4309 3613 696 285 55603
Culture.Food and drink 1598 1140 458 116 58483
Culture.Internet culture 3175 2600 575 242 56780
Culture.Linguistics 1610 1149 461 86 58501
Culture.Literature 4968 3573 1395 516 54713
Culture.Media.Books 1697 1332 365 136 58364
Culture.Media.Entertainment 2268 1029 1239 307 57622
Culture.Media.Films 2531 2078 453 115 57551
Culture.Media.Media* 14349 12485 1864 1361 44487
Culture.Media.Music 3307 2782 525 261 56629
Culture.Media.Radio 425 198 227 41 59731
Culture.Media.Software 2392 1841 551 352 57453
Culture.Media.Television 2548 2167 381 121 57528
Culture.Media.Video games 1896 1727 169 49 58252
Culture.Performing arts 1520 960 560 134 58543
Culture.Philosophy and religion 4401 2479 1922 454 55342
Culture.Sports 5891 5267 624 191 54115
Culture.Visual arts.Architecture 2309 1650 659 240 57648
Culture.Visual arts.Comics and Anime 1679 1436 243 43 58475
Culture.Visual arts.Fashion 1059 702 357 83 59055
Culture.Visual arts.Visual arts* 5915 4343 1572 432 53850
Geography.Geographical 4652 3358 1294 725 54820
Geography.Regions.Africa.Africa* 4090 2581 1509 309 55798
Geography.Regions.Africa.Central Africa 732 332 400 74 59391
Geography.Regions.Africa.Eastern Africa 474 259 215 37 59686
Geography.Regions.Africa.Northern Africa 1558 938 620 122 58517
Geography.Regions.Africa.Southern Africa 675 312 363 59 59463
Geography.Regions.Africa.Western Africa 187 82 105 32 59978
Geography.Regions.Americas.Central America 1354 637 717 77 58766
Geography.Regions.Americas.North America 6020 3695 2325 981 53196
Geography.Regions.Americas.South America 1478 997 481 104 58615
Geography.Regions.Asia.Asia* 10847 8568 2279 888 48462
Geography.Regions.Asia.Central Asia 1311 813 498 117 58769
Geography.Regions.Asia.East Asia 2966 2257 709 188 57043
Geography.Regions.Asia.North Asia 1944 1268 676 292 57961
Geography.Regions.Asia.South Asia 1814 1327 487 96 58287
Geography.Regions.Asia.Southeast Asia 1603 925 678 88 58506
Geography.Regions.Asia.West Asia 2874 2191 683 175 57148
Geography.Regions.Europe.Eastern Europe 5252 3745 1507 665 54280
Geography.Regions.Europe.Europe* 16883 13054 3829 2544 40770
Geography.Regions.Europe.Northern Europe 4019 2435 1584 447 55731
Geography.Regions.Europe.Southern Europe 3488 2254 1234 338 56371
Geography.Regions.Europe.Western Europe 5209 3621 1588 641 54347
Geography.Regions.Oceania 1774 1198 576 87 58336
History and Society.Business and economics 3567 2068 1499 361 56269
History and Society.Education 1821 872 949 154 58222
History and Society.History 5592 3331 2261 695 53910
History and Society.Military and warfare 5797 4112 1685 591 53809
History and Society.Politics and government 4709 2502 2207 519 54969
History and Society.Society 6861 3083 3778 693 52643
History and Society.Transportation 4282 3865 417 165 55750
STEM.Biology 3828 3077 751 187 56182
STEM.Chemistry 1607 1290 317 147 58443
STEM.Computing 2851 2322 529 429 56917
STEM.Earth and environment 2224 1458 766 159 57814
STEM.Engineering 2978 2273 705 228 56991
STEM.Libraries & Information 717 467 250 57 59423
STEM.Mathematics 1208 937 271 77 58912
STEM.Medicine & Health 2098 1387 711 215 57884
STEM.Physics 1466 960 506 158 58573
STEM.STEM* 19061 16828 2233 1061 40075
STEM.Space 2057 1868 189 48 58092
STEM.Technology 5209 3765 1444 711 54277

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.273 0.123
Culture.Biography.Women 0.072 0.015
Culture.Food and drink 0.027 0.002
Culture.Internet culture 0.053 0.003
Culture.Linguistics 0.027 0.007
Culture.Literature 0.083 0.015
Culture.Media.Books 0.028 0.004
Culture.Media.Entertainment 0.038 0.004
Culture.Media.Films 0.042 0.011
Culture.Media.Media* 0.238 0.058
Culture.Media.Music 0.055 0.024
Culture.Media.Radio 0.007 0.002
Culture.Media.Software 0.04 0.001
Culture.Media.Television 0.042 0.009
Culture.Media.Video games 0.031 0.003
Culture.Performing arts 0.025 0.003
Culture.Philosophy and religion 0.073 0.011
Culture.Sports 0.098 0.071
Culture.Visual arts.Architecture 0.038 0.011
Culture.Visual arts.Comics and Anime 0.028 0.002
Culture.Visual arts.Fashion 0.018 0.001
Culture.Visual arts.Visual arts* 0.098 0.018
Geography.Geographical 0.077 0.024
Geography.Regions.Africa.Africa* 0.068 0.008
Geography.Regions.Africa.Central Africa 0.012 0.001
Geography.Regions.Africa.Eastern Africa 0.008 0
Geography.Regions.Africa.Northern Africa 0.026 0.001
Geography.Regions.Africa.Southern Africa 0.011 0.001
Geography.Regions.Africa.Western Africa 0.003 0.001
Geography.Regions.Americas.Central America 0.022 0.003
Geography.Regions.Americas.North America 0.1 0.064
Geography.Regions.Americas.South America 0.025 0.006
Geography.Regions.Asia.Asia* 0.18 0.045
Geography.Regions.Asia.Central Asia 0.022 0.001
Geography.Regions.Asia.East Asia 0.049 0.011
Geography.Regions.Asia.North Asia 0.032 0.001
Geography.Regions.Asia.South Asia 0.03 0.015
Geography.Regions.Asia.Southeast Asia 0.027 0.006
Geography.Regions.Asia.West Asia 0.048 0.011
Geography.Regions.Europe.Eastern Europe 0.087 0.013
Geography.Regions.Europe.Europe* 0.28 0.076
Geography.Regions.Europe.Northern Europe 0.067 0.031
Geography.Regions.Europe.Southern Europe 0.058 0.013
Geography.Regions.Europe.Western Europe 0.087 0.019
Geography.Regions.Oceania 0.029 0.015
History and Society.Business and economics 0.059 0.01
History and Society.Education 0.03 0.007
History and Society.History 0.093 0.011
History and Society.Military and warfare 0.096 0.014
History and Society.Politics and government 0.078 0.028
History and Society.Society 0.114 0.013
History and Society.Transportation 0.071 0.015
STEM.Biology 0.064 0.034
STEM.Chemistry 0.027 0.002
STEM.Computing 0.047 0.003
STEM.Earth and environment 0.037 0.005
STEM.Engineering 0.049 0.005
STEM.Libraries & Information 0.012 0.001
STEM.Mathematics 0.02 0
STEM.Medicine & Health 0.035 0.006
STEM.Physics 0.024 0.001
STEM.STEM* 0.317 0.069
STEM.Space 0.034 0.006
STEM.Technology 0.087 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.13 0.87 0.919 0.872 0.895 0.973 0.982 0.953
Culture.Biography.Women 0.017 0.983 0.838 0.708 0.768 0.993 0.983 0.781
Culture.Food and drink 0.004 0.996 0.713 0.471 0.567 0.997 0.979 0.603
Culture.Internet culture 0.007 0.993 0.819 0.404 0.541 0.995 0.984 0.726
Culture.Linguistics 0.007 0.993 0.714 0.782 0.746 0.996 0.976 0.74
Culture.Literature 0.02 0.98 0.719 0.548 0.622 0.986 0.974 0.667
Culture.Media.Books 0.005 0.995 0.785 0.577 0.665 0.997 0.982 0.689
Culture.Media.Entertainment 0.007 0.993 0.454 0.235 0.31 0.993 0.965 0.244
Culture.Media.Films 0.011 0.989 0.821 0.814 0.817 0.996 0.983 0.825
Culture.Media.Media* 0.079 0.921 0.87 0.645 0.741 0.964 0.978 0.854
Culture.Media.Music 0.025 0.975 0.841 0.818 0.829 0.992 0.984 0.849
Culture.Media.Radio 0.002 0.998 0.466 0.595 0.522 0.998 0.947 0.334
Culture.Media.Software 0.007 0.993 0.77 0.144 0.242 0.994 0.985 0.337
Culture.Media.Television 0.01 0.99 0.85 0.783 0.815 0.997 0.985 0.873
Culture.Media.Video games 0.003 0.997 0.911 0.739 0.816 0.999 0.99 0.904
Culture.Performing arts 0.004 0.996 0.632 0.445 0.522 0.997 0.976 0.516
Culture.Philosophy and religion 0.014 0.986 0.563 0.427 0.486 0.987 0.952 0.471
Culture.Sports 0.067 0.933 0.894 0.951 0.922 0.989 0.982 0.955
Culture.Visual arts.Architecture 0.012 0.988 0.715 0.648 0.68 0.993 0.98 0.68
Culture.Visual arts.Comics and Anime 0.003 0.997 0.855 0.719 0.781 0.999 0.988 0.793
Culture.Visual arts.Fashion 0.002 0.998 0.663 0.277 0.39 0.998 0.977 0.4
Culture.Visual arts.Visual arts* 0.021 0.979 0.734 0.632 0.679 0.987 0.973 0.74
Geography.Geographical 0.03 0.97 0.722 0.572 0.638 0.981 0.975 0.651
Geography.Regions.Africa.Africa* 0.01 0.99 0.631 0.474 0.541 0.992 0.968 0.537
Geography.Regions.Africa.Central Africa 0.002 0.998 0.454 0.187 0.265 0.998 0.973 0.15
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.546 0.286 0.376 0.999 0.961 0.219
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.602 0.262 0.365 0.997 0.974 0.321
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.462 0.354 0.401 0.998 0.948 0.215
Geography.Regions.Africa.Western Africa 0.001 0.999 0.439 0.36 0.395 0.999 0.909 0.194
Geography.Regions.Americas.Central America 0.003 0.997 0.47 0.543 0.504 0.997 0.963 0.473
Geography.Regions.Americas.North America 0.056 0.944 0.614 0.7 0.654 0.958 0.957 0.712
Geography.Regions.Americas.South America 0.006 0.994 0.675 0.707 0.691 0.996 0.973 0.693
Geography.Regions.Asia.Asia* 0.053 0.947 0.79 0.677 0.729 0.973 0.97 0.809
Geography.Regions.Asia.Central Asia 0.003 0.997 0.62 0.213 0.317 0.998 0.973 0.232
Geography.Regions.Asia.East Asia 0.012 0.988 0.761 0.728 0.744 0.994 0.98 0.791
Geography.Regions.Asia.North Asia 0.006 0.994 0.652 0.107 0.184 0.995 0.983 0.214
Geography.Regions.Asia.South Asia 0.013 0.987 0.732 0.873 0.796 0.994 0.97 0.816
Geography.Regions.Asia.Southeast Asia 0.005 0.995 0.577 0.699 0.632 0.996 0.968 0.587
Geography.Regions.Asia.West Asia 0.011 0.989 0.762 0.734 0.748 0.994 0.979 0.8
Geography.Regions.Europe.Eastern Europe 0.021 0.979 0.713 0.434 0.539 0.984 0.971 0.573
Geography.Regions.Europe.Europe* 0.113 0.887 0.773 0.52 0.622 0.928 0.946 0.687
Geography.Regions.Europe.Northern Europe 0.026 0.974 0.606 0.706 0.652 0.98 0.963 0.711
Geography.Regions.Europe.Southern Europe 0.014 0.986 0.646 0.589 0.616 0.99 0.97 0.643
Geography.Regions.Europe.Western Europe 0.025 0.975 0.695 0.538 0.607 0.983 0.97 0.63
Geography.Regions.Oceania 0.012 0.988 0.675 0.874 0.762 0.994 0.975 0.806
History and Society.Business and economics 0.012 0.988 0.58 0.481 0.526 0.989 0.962 0.507
History and Society.Education 0.006 0.994 0.479 0.574 0.522 0.994 0.963 0.494
History and Society.History 0.019 0.981 0.596 0.339 0.432 0.983 0.956 0.446
History and Society.Military and warfare 0.021 0.979 0.709 0.482 0.574 0.985 0.971 0.654
History and Society.Politics and government 0.024 0.976 0.531 0.622 0.573 0.978 0.95 0.596
History and Society.Society 0.018 0.982 0.449 0.306 0.364 0.98 0.917 0.325
History and Society.Transportation 0.017 0.983 0.903 0.824 0.862 0.996 0.987 0.912
STEM.Biology 0.03 0.97 0.804 0.894 0.846 0.99 0.979 0.898
STEM.Chemistry 0.004 0.996 0.803 0.333 0.471 0.997 0.987 0.418
STEM.Computing 0.01 0.99 0.814 0.227 0.355 0.992 0.986 0.367
STEM.Earth and environment 0.006 0.994 0.656 0.521 0.581 0.996 0.976 0.596
STEM.Engineering 0.008 0.992 0.763 0.502 0.606 0.995 0.98 0.71
STEM.Libraries & Information 0.001 0.999 0.651 0.297 0.408 0.999 0.968 0.385
STEM.Mathematics 0.002 0.998 0.776 0.199 0.316 0.999 0.983 0.494
STEM.Medicine & Health 0.008 0.992 0.661 0.535 0.591 0.994 0.974 0.599
STEM.Physics 0.003 0.997 0.655 0.171 0.272 0.997 0.981 0.24
STEM.STEM* 0.085 0.915 0.883 0.718 0.792 0.968 0.975 0.883
STEM.Space 0.006 0.994 0.908 0.869 0.888 0.999 0.993 0.951
STEM.Technology 0.017 0.983 0.723 0.224 0.342 0.986 0.974 0.479

Implementation

[edit]
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "init": null,
        "population_rates": null,
        "max_depth": 5,
        "verbose": 0,
        "ccp_alpha": 0.0,
        "validation_fraction": 0.1,
        "warm_start": false,
        "loss": "deviance",
        "min_weight_fraction_leaf": 0.0,
        "max_leaf_nodes": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "presort": "deprecated",
        "min_samples_split": 2,
        "random_state": null,
        "min_samples_leaf": 1,
        "center": false,
        "multilabel": true,
        "max_features": "log2",
        "learning_rate": 0.1,
        "n_estimators": 150,
        "min_impurity_split": null,
        "min_impurity_decrease": 0.0,
        "label_weights": {},
        "criterion": "friedman_mse",
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "n_iter_no_change": null,
        "scale": false
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "items": {
                "type": "string"
            },
            "type": "array"
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "properties": {
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                }
            },
            "type": "object"
        }
    },
    "type": "object"
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/cswiki/1234/articletopic

Output:

Example output
{
    "cswiki": {
        "models": {
            "articletopic": {
                "version": "1.3.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "STEM.STEM*",
                            "STEM.Space"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.0037581446757912217,
                            "Culture.Biography.Women": 0.0005751504978634069,
                            "Culture.Food and drink": 0.00013629102583591308,
                            "Culture.Internet culture": 0.00024705233108557957,
                            "Culture.Linguistics": 4.5023636851888105e-05,
                            "Culture.Literature": 0.000693481254355697,
                            "Culture.Media.Books": 0.0003302654788886533,
                            "Culture.Media.Entertainment": 0.0004557010040644442,
                            "Culture.Media.Films": 0.000745755840342992,
                            "Culture.Media.Media*": 0.0027738377186352623,
                            "Culture.Media.Music": 0.00017563847268040663,
                            "Culture.Media.Radio": 5.406588440219966e-05,
                            "Culture.Media.Software": 0.0008980023457288975,
                            "Culture.Media.Television": 0.000548303780375146,
                            "Culture.Media.Video games": 3.341663101629864e-05,
                            "Culture.Performing arts": 0.00010382012651605603,
                            "Culture.Philosophy and religion": 0.0018340714233612992,
                            "Culture.Sports": 0.0003340386603068036,
                            "Culture.Visual arts.Architecture": 0.0006493433462786338,
                            "Culture.Visual arts.Comics and Anime": 5.501988377413894e-05,
                            "Culture.Visual arts.Fashion": 0.00020240637526355807,
                            "Culture.Visual arts.Visual arts*": 0.0011734964014660275,
                            "Geography.Geographical": 0.003862913548780497,
                            "Geography.Regions.Africa.Africa*": 0.00283510137778092,
                            "Geography.Regions.Africa.Central Africa": 0.0007717336771633315,
                            "Geography.Regions.Africa.Eastern Africa": 5.4382462367102405e-05,
                            "Geography.Regions.Africa.Northern Africa": 0.0002468113306020045,
                            "Geography.Regions.Africa.Southern Africa": 0.005759778854208486,
                            "Geography.Regions.Africa.Western Africa": 4.758283244842901e-06,
                            "Geography.Regions.Americas.Central America": 0.0007230522994775956,
                            "Geography.Regions.Americas.North America": 0.008559027362312165,
                            "Geography.Regions.Americas.South America": 0.0010225938156441852,
                            "Geography.Regions.Asia.Asia*": 0.004625047209643945,
                            "Geography.Regions.Asia.Central Asia": 0.0005234422947221679,
                            "Geography.Regions.Asia.East Asia": 0.0036567708976141464,
                            "Geography.Regions.Asia.North Asia": 0.004117880625705837,
                            "Geography.Regions.Asia.South Asia": 0.001231314226046212,
                            "Geography.Regions.Asia.Southeast Asia": 0.0008340922220495317,
                            "Geography.Regions.Asia.West Asia": 0.0003554093910782288,
                            "Geography.Regions.Europe.Eastern Europe": 0.0065926663899134414,
                            "Geography.Regions.Europe.Europe*": 0.013979080013835353,
                            "Geography.Regions.Europe.Northern Europe": 0.00154029611316488,
                            "Geography.Regions.Europe.Southern Europe": 0.00068486964155431,
                            "Geography.Regions.Europe.Western Europe": 0.0027471763060663364,
                            "Geography.Regions.Oceania": 0.0003717339481496353,
                            "History and Society.Business and economics": 0.002207248293968202,
                            "History and Society.Education": 0.0004226747257376408,
                            "History and Society.History": 0.003777251589564783,
                            "History and Society.Military and warfare": 0.0047491280843299875,
                            "History and Society.Politics and government": 0.005142941696595345,
                            "History and Society.Society": 0.033398761452264884,
                            "History and Society.Transportation": 0.11443321001987419,
                            "STEM.Biology": 0.002447465453906354,
                            "STEM.Chemistry": 0.005747175118744066,
                            "STEM.Computing": 0.00045352819823348217,
                            "STEM.Earth and environment": 0.0033121889091357634,
                            "STEM.Engineering": 0.004281775261544969,
                            "STEM.Libraries & Information": 7.468451141451151e-05,
                            "STEM.Mathematics": 4.996334970267717e-05,
                            "STEM.Medicine & Health": 0.0007865993537385966,
                            "STEM.Physics": 0.29800230259682176,
                            "STEM.STEM*": 0.9983754315472244,
                            "STEM.Space": 0.996157360237683,
                            "STEM.Technology": 0.14559689671134318
                        }
                    }
                }
            }
        }
    }
}

Data

[edit]
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

[edit]

Citation

[edit]

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Czech_Wikipedia_article_topic,
  title={ Czech Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Czech_Wikipedia_article_topic }
}