Machine learning models/Production/Wikidata item topic


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Wikidata
This model uses item features to predict the likelihood that the item belongs to a set of topics.


Motivation

edit

How can we predict what general topic an item is in? Answering this question is useful for various analyses of Wikidata dynamics. However, it is difficult to group a very diverse range of Wikidata items into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an item to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikidata dynamics (pageviews, item quality, edit trends) and filtering items.

Users and uses

edit
Use this model for
  • high-level analyses of Wikidata dynamics such as pageview, item quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant items — e.g. filter items only to those in the music category.
Don't use this model for
  • definitively establishing what topic an items pertains to
  • automated editing of items or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikidata, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/wikidatawiki/1907686315/itemtopic

Ethical considerations, caveats, and recommendations

edit
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikidata has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

edit

Performance

edit

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 16670 15762 908 464 46810
Culture.Biography.Women 4110 3125 985 679 59155
Culture.Food and drink 1318 613 705 126 62500
Culture.Internet culture 2966 1948 1018 140 60838
Culture.Linguistics 1466 934 532 56 62422
Culture.Literature 5367 3996 1371 404 58173
Culture.Media.Books 1974 1560 414 136 61834
Culture.Media.Entertainment 1733 857 876 162 62049
Culture.Media.Films 2295 1896 399 122 61527
Culture.Media.Media* 14383 11572 2811 1135 48426
Culture.Media.Music 2583 2027 556 247 61114
Culture.Media.Radio 1156 857 299 44 62744
Culture.Media.Software 1750 685 1065 307 61887
Culture.Media.Television 2230 1510 720 176 61538
Culture.Media.Video games 2147 1758 389 54 61743
Culture.Performing arts 1334 741 593 116 62494
Culture.Philosophy and religion 2702 1074 1628 285 60957
Culture.Sports 5925 5186 739 249 57770
Culture.Visual arts.Architecture 2648 1867 781 230 61066
Culture.Visual arts.Comics and Anime 1508 1007 501 140 62296
Culture.Visual arts.Fashion 1199 669 530 98 62647
Culture.Visual arts.Visual arts* 6070 4131 1939 554 57320
Geography.Geographical 3464 2226 1238 359 60121
Geography.Regions.Africa.Africa* 6449 4664 1785 414 57081
Geography.Regions.Africa.Central Africa 1145 697 448 83 62716
Geography.Regions.Africa.Eastern Africa 1114 704 410 56 62774
Geography.Regions.Africa.Northern Africa 1280 774 506 108 62556
Geography.Regions.Africa.Southern Africa 1244 859 385 81 62619
Geography.Regions.Africa.Western Africa 1142 774 368 75 62727
Geography.Regions.Americas.Central America 1331 707 624 87 62526
Geography.Regions.Americas.North America 7625 5064 2561 1169 55150
Geography.Regions.Americas.South America 1532 1082 450 142 62270
Geography.Regions.Asia.Asia* 11647 8432 3215 835 51462
Geography.Regions.Asia.Central Asia 1086 671 415 70 62788
Geography.Regions.Asia.East Asia 2717 1727 990 241 60986
Geography.Regions.Asia.North Asia 2076 1336 740 163 61705
Geography.Regions.Asia.South Asia 2366 1612 754 135 61443
Geography.Regions.Asia.Southeast Asia 1721 1059 662 119 62104
Geography.Regions.Asia.West Asia 2160 1473 687 129 61655
Geography.Regions.Europe.Eastern Europe 3533 2472 1061 234 60177
Geography.Regions.Europe.Europe* 12939 9372 3567 1810 49195
Geography.Regions.Europe.Northern Europe 4221 2571 1650 601 59122
Geography.Regions.Europe.Southern Europe 2438 1565 873 268 61238
Geography.Regions.Europe.Western Europe 3076 1934 1142 417 60451
Geography.Regions.Oceania 2638 1859 779 138 61168
History and Society.Business and economics 3502 1544 1958 569 59873
History and Society.Education 2243 1113 1130 255 61446
History and Society.History 3172 1154 2018 360 60412
History and Society.Military and warfare 3238 1677 1561 296 60410
History and Society.Politics and government 4590 2406 2184 329 59025
History and Society.Society 2971 897 2074 166 60807
History and Society.Transportation 3629 2615 1014 169 60146
STEM.Biology 2916 2237 679 91 60937
STEM.Chemistry 1270 690 580 138 62536
STEM.Computing 1968 828 1140 332 61644
STEM.Earth and environment 1627 918 709 114 62203
STEM.Engineering 2195 1284 911 141 61608
STEM.Libraries & Information 1174 605 569 87 62683
STEM.Mathematics 1137 307 830 107 62700
STEM.Medicine & Health 1726 769 957 180 62038
STEM.Physics 1219 448 771 107 62618
STEM.STEM* 16449 12609 3840 2766 44729
STEM.Space 1365 932 433 47 62532
STEM.Technology 3648 1396 2252 424 59872

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.261 0.12
Culture.Biography.Women 0.064 0.015
Culture.Food and drink 0.021 0.003
Culture.Internet culture 0.046 0.004
Culture.Linguistics 0.023 0.008
Culture.Literature 0.084 0.015
Culture.Media.Books 0.031 0.004
Culture.Media.Entertainment 0.027 0.004
Culture.Media.Films 0.036 0.012
Culture.Media.Media* 0.225 0.055
Culture.Media.Music 0.04 0.021
Culture.Media.Radio 0.018 0.002
Culture.Media.Software 0.027 0.001
Culture.Media.Television 0.035 0.009
Culture.Media.Video games 0.034 0.003
Culture.Performing arts 0.021 0.003
Culture.Philosophy and religion 0.042 0.01
Culture.Sports 0.093 0.06
Culture.Visual arts.Architecture 0.041 0.011
Culture.Visual arts.Comics and Anime 0.024 0.002
Culture.Visual arts.Fashion 0.019 0.001
Culture.Visual arts.Visual arts* 0.095 0.018
Geography.Geographical 0.054 0.021
Geography.Regions.Africa.Africa* 0.101 0.008
Geography.Regions.Africa.Central Africa 0.018 0.001
Geography.Regions.Africa.Eastern Africa 0.017 0.001
Geography.Regions.Africa.Northern Africa 0.02 0.001
Geography.Regions.Africa.Southern Africa 0.019 0.001
Geography.Regions.Africa.Western Africa 0.018 0.001
Geography.Regions.Americas.Central America 0.021 0.003
Geography.Regions.Americas.North America 0.119 0.063
Geography.Regions.Americas.South America 0.024 0.007
Geography.Regions.Asia.Asia* 0.182 0.052
Geography.Regions.Asia.Central Asia 0.017 0.001
Geography.Regions.Asia.East Asia 0.042 0.012
Geography.Regions.Asia.North Asia 0.032 0.006
Geography.Regions.Asia.South Asia 0.037 0.016
Geography.Regions.Asia.Southeast Asia 0.027 0.006
Geography.Regions.Asia.West Asia 0.034 0.012
Geography.Regions.Europe.Eastern Europe 0.055 0.018
Geography.Regions.Europe.Europe* 0.202 0.081
Geography.Regions.Europe.Northern Europe 0.066 0.029
Geography.Regions.Europe.Southern Europe 0.038 0.014
Geography.Regions.Europe.Western Europe 0.048 0.02
Geography.Regions.Oceania 0.041 0.016
History and Society.Business and economics 0.055 0.01
History and Society.Education 0.035 0.008
History and Society.History 0.05 0.011
History and Society.Military and warfare 0.051 0.015
History and Society.Politics and government 0.072 0.028
History and Society.Society 0.046 0.008
History and Society.Transportation 0.057 0.016
STEM.Biology 0.046 0.034
STEM.Chemistry 0.02 0.002
STEM.Computing 0.031 0.003
STEM.Earth and environment 0.025 0.005
STEM.Engineering 0.034 0.006
STEM.Libraries & Information 0.018 0.001
STEM.Mathematics 0.018 0
STEM.Medicine & Health 0.027 0.006
STEM.Physics 0.019 0.001
STEM.STEM* 0.257 0.065
STEM.Space 0.021 0.004
STEM.Technology 0.057 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.122 0.878 0.946 0.929 0.937 0.985 0.982 0.952
Culture.Biography.Women 0.023 0.977 0.76 0.504 0.606 0.985 0.975 0.589
Culture.Food and drink 0.003 0.997 0.465 0.371 0.413 0.997 0.937 0.352
Culture.Internet culture 0.005 0.995 0.657 0.517 0.578 0.996 0.96 0.549
Culture.Linguistics 0.006 0.994 0.637 0.852 0.729 0.996 0.954 0.656
Culture.Literature 0.018 0.982 0.745 0.619 0.676 0.989 0.965 0.726
Culture.Media.Books 0.006 0.994 0.79 0.609 0.688 0.997 0.971 0.659
Culture.Media.Entertainment 0.005 0.995 0.495 0.429 0.459 0.995 0.947 0.433
Culture.Media.Films 0.011 0.989 0.826 0.829 0.828 0.996 0.974 0.813
Culture.Media.Media* 0.066 0.934 0.805 0.67 0.731 0.968 0.966 0.813
Culture.Media.Music 0.02 0.98 0.785 0.807 0.795 0.992 0.974 0.818
Culture.Media.Radio 0.002 0.998 0.741 0.711 0.726 0.999 0.962 0.741
Culture.Media.Software 0.005 0.995 0.391 0.094 0.152 0.994 0.946 0.094
Culture.Media.Television 0.009 0.991 0.677 0.68 0.678 0.994 0.964 0.664
Culture.Media.Video games 0.003 0.997 0.819 0.732 0.773 0.999 0.977 0.801
Culture.Performing arts 0.004 0.996 0.555 0.478 0.514 0.997 0.947 0.414
Culture.Philosophy and religion 0.009 0.991 0.397 0.472 0.431 0.989 0.91 0.339
Culture.Sports 0.057 0.943 0.875 0.929 0.901 0.988 0.976 0.933
Culture.Visual arts.Architecture 0.011 0.989 0.705 0.672 0.688 0.993 0.969 0.673
Culture.Visual arts.Comics and Anime 0.004 0.996 0.668 0.416 0.513 0.997 0.966 0.558
Culture.Visual arts.Fashion 0.002 0.998 0.558 0.242 0.338 0.998 0.95 0.215
Culture.Visual arts.Visual arts* 0.022 0.978 0.681 0.566 0.618 0.985 0.952 0.666
Geography.Geographical 0.019 0.981 0.643 0.7 0.67 0.987 0.956 0.698
Geography.Regions.Africa.Africa* 0.013 0.987 0.723 0.462 0.564 0.991 0.96 0.639
Geography.Regions.Africa.Central Africa 0.002 0.998 0.609 0.244 0.349 0.998 0.954 0.321
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.632 0.262 0.371 0.999 0.957 0.253
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.605 0.321 0.42 0.998 0.945 0.343
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.691 0.411 0.515 0.998 0.959 0.514
Geography.Regions.Africa.Western Africa 0.002 0.998 0.678 0.297 0.413 0.999 0.96 0.277
Geography.Regions.Americas.Central America 0.003 0.997 0.531 0.569 0.55 0.997 0.932 0.494
Geography.Regions.Americas.North America 0.061 0.939 0.664 0.682 0.673 0.959 0.949 0.726
Geography.Regions.Americas.South America 0.007 0.993 0.706 0.681 0.693 0.996 0.963 0.691
Geography.Regions.Asia.Asia* 0.053 0.947 0.724 0.715 0.719 0.97 0.949 0.756
Geography.Regions.Asia.Central Asia 0.002 0.998 0.618 0.306 0.41 0.999 0.952 0.462
Geography.Regions.Asia.East Asia 0.012 0.988 0.636 0.665 0.65 0.992 0.95 0.625
Geography.Regions.Asia.North Asia 0.006 0.994 0.644 0.579 0.609 0.995 0.946 0.55
Geography.Regions.Asia.South Asia 0.013 0.987 0.681 0.839 0.752 0.993 0.953 0.708
Geography.Regions.Asia.Southeast Asia 0.006 0.994 0.615 0.668 0.641 0.996 0.942 0.557
Geography.Regions.Asia.West Asia 0.01 0.99 0.682 0.794 0.734 0.994 0.953 0.662
Geography.Regions.Europe.Eastern Europe 0.017 0.983 0.7 0.771 0.733 0.991 0.951 0.71
Geography.Regions.Europe.Europe* 0.091 0.909 0.724 0.642 0.681 0.945 0.943 0.744
Geography.Regions.Europe.Northern Europe 0.027 0.973 0.609 0.643 0.626 0.979 0.948 0.644
Geography.Regions.Europe.Southern Europe 0.013 0.987 0.642 0.673 0.657 0.991 0.949 0.618
Geography.Regions.Europe.Western Europe 0.02 0.98 0.629 0.657 0.643 0.986 0.95 0.63
Geography.Regions.Oceania 0.014 0.986 0.705 0.839 0.766 0.993 0.96 0.75
History and Society.Business and economics 0.014 0.986 0.441 0.315 0.367 0.985 0.936 0.248
History and Society.Education 0.008 0.992 0.496 0.489 0.493 0.992 0.944 0.4
History and Society.History 0.01 0.99 0.364 0.403 0.382 0.987 0.916 0.315
History and Society.Military and warfare 0.013 0.987 0.518 0.622 0.565 0.988 0.933 0.521
History and Society.Politics and government 0.02 0.98 0.524 0.731 0.611 0.981 0.925 0.603
History and Society.Society 0.005 0.995 0.302 0.48 0.371 0.992 0.871 0.318
History and Society.Transportation 0.014 0.986 0.721 0.809 0.762 0.993 0.963 0.712
STEM.Biology 0.028 0.972 0.767 0.948 0.848 0.991 0.962 0.816
STEM.Chemistry 0.003 0.997 0.543 0.294 0.382 0.997 0.958 0.27
STEM.Computing 0.007 0.993 0.421 0.182 0.254 0.993 0.951 0.149
STEM.Earth and environment 0.004 0.996 0.564 0.594 0.579 0.996 0.947 0.522
STEM.Engineering 0.006 0.994 0.585 0.596 0.591 0.995 0.947 0.51
STEM.Libraries & Information 0.002 0.998 0.515 0.203 0.291 0.998 0.948 0.238
STEM.Mathematics 0.002 0.998 0.27 0.068 0.109 0.998 0.942 0.125
STEM.Medicine & Health 0.006 0.994 0.446 0.499 0.471 0.994 0.933 0.398
STEM.Physics 0.002 0.998 0.368 0.168 0.23 0.998 0.945 0.126
STEM.STEM* 0.104 0.896 0.767 0.477 0.588 0.93 0.955 0.768
STEM.Space 0.004 0.996 0.683 0.795 0.735 0.998 0.963 0.686
STEM.Technology 0.009 0.991 0.383 0.219 0.279 0.99 0.928 0.213

Implementation

edit
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "scale": false,
        "center": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "multilabel": true,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.1,
        "loss": "deviance",
        "max_depth": 5,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 150,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false,
        "label_weights": {}
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/wikidatawiki/1907686315/itemtopic

Output:

Example output
{
    "wikidatawiki": {
        "models": {
            "itemtopic": {
                "version": "1.2.0"
            }
        },
        "scores": {
            "1907686315": {
                "itemtopic": {
                    "score": {
                        "prediction": [
                            "STEM.STEM*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.009059893345632097,
                            "Culture.Biography.Women": 0.0006924491258526178,
                            "Culture.Food and drink": 0.0006399242658997215,
                            "Culture.Internet culture": 0.0009384780459913412,
                            "Culture.Linguistics": 0.0018606277391225432,
                            "Culture.Literature": 0.003990388751181737,
                            "Culture.Media.Books": 0.0006214752106656115,
                            "Culture.Media.Entertainment": 0.001104834881085509,
                            "Culture.Media.Films": 0.0011465477594696284,
                            "Culture.Media.Media*": 0.009497882960118977,
                            "Culture.Media.Music": 0.0005314326820035878,
                            "Culture.Media.Radio": 0.0001418663128807519,
                            "Culture.Media.Software": 0.0006122966374525156,
                            "Culture.Media.Television": 0.0011153877562536376,
                            "Culture.Media.Video games": 0.0006372239889671269,
                            "Culture.Performing arts": 0.0006531356388159476,
                            "Culture.Philosophy and religion": 0.01399521934257544,
                            "Culture.Sports": 0.0018462250677368348,
                            "Culture.Visual arts.Architecture": 0.0016560396166840437,
                            "Culture.Visual arts.Comics and Anime": 0.0005305955236163667,
                            "Culture.Visual arts.Fashion": 0.000537788411976724,
                            "Culture.Visual arts.Visual arts*": 0.009907875401930734,
                            "Geography.Geographical": 0.01571363482516823,
                            "Geography.Regions.Africa.Africa*": 0.020280349224975614,
                            "Geography.Regions.Africa.Central Africa": 0.0007006250310735848,
                            "Geography.Regions.Africa.Eastern Africa": 0.000981468869640802,
                            "Geography.Regions.Africa.Northern Africa": 0.015712323087656205,
                            "Geography.Regions.Africa.Southern Africa": 0.001221937118377821,
                            "Geography.Regions.Africa.Western Africa": 0.0008305320623083369,
                            "Geography.Regions.Americas.Central America": 0.001306842712476455,
                            "Geography.Regions.Americas.North America": 0.030570993625411366,
                            "Geography.Regions.Americas.South America": 0.009381192562807516,
                            "Geography.Regions.Asia.Asia*": 0.08763779333893186,
                            "Geography.Regions.Asia.Central Asia": 0.0021630529281042718,
                            "Geography.Regions.Asia.East Asia": 0.01090968773383821,
                            "Geography.Regions.Asia.North Asia": 0.029951228290233667,
                            "Geography.Regions.Asia.South Asia": 0.005977584426712786,
                            "Geography.Regions.Asia.Southeast Asia": 0.0028688045552628266,
                            "Geography.Regions.Asia.West Asia": 0.0009526856502617891,
                            "Geography.Regions.Europe.Eastern Europe": 0.029972291587851183,
                            "Geography.Regions.Europe.Europe*": 0.13296776378542635,
                            "Geography.Regions.Europe.Northern Europe": 0.016907973275604154,
                            "Geography.Regions.Europe.Southern Europe": 0.005813048163270592,
                            "Geography.Regions.Europe.Western Europe": 0.005037055635498127,
                            "Geography.Regions.Oceania": 0.007780720915153282,
                            "History and Society.Business and economics": 0.005890874106250135,
                            "History and Society.Education": 0.0017680572320172617,
                            "History and Society.History": 0.01973006391755843,
                            "History and Society.Military and warfare": 0.006573635883462243,
                            "History and Society.Politics and government": 0.007573132449112524,
                            "History and Society.Society": 0.04381007914549254,
                            "History and Society.Transportation": 0.002797769913886188,
                            "STEM.Biology": 0.005780672890531569,
                            "STEM.Chemistry": 0.0022570835539507676,
                            "STEM.Computing": 0.0018290751421398967,
                            "STEM.Earth and environment": 0.0795914195853073,
                            "STEM.Engineering": 0.004058097854564882,
                            "STEM.Libraries & Information": 0.0010339015208737487,
                            "STEM.Mathematics": 0.0017040157655581244,
                            "STEM.Medicine & Health": 0.005650365932513206,
                            "STEM.Physics": 0.020150498627265184,
                            "STEM.STEM*": 0.8790296717258461,
                            "STEM.Space": 0.11458869168317454,
                            "STEM.Technology": 0.012381701761546463
                        }
                    }
                }
            }
        }
    }
}

Data

edit
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an item embedding. Finally, labels are derived from the mid-level WikiProject categories that the item is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

edit

Citation

edit

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Wikidata_item_topic,
  title={ Wikidata item topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Wikidata_item_topic }
}