Machine learning models/Production/Wikidata item topic
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Aaron Halfaker (User:EpochFail) and Amir Sarabadani |
Model owner(s) | WMF Machine Learning Team (ml@wikimediafoundation.org) |
Model interface | Ores homepage |
Code | drafttopic Github, ORES training data, and ORES model binaries |
Uses PII | No |
In production? | Yes |
Which projects? | Wikidata |
This model uses item features to predict the likelihood that the item belongs to a set of topics. | |
Motivation
editHow can we predict what general topic an item is in? Answering this question is useful for various analyses of Wikidata dynamics. However, it is difficult to group a very diverse range of Wikidata items into coherent, consistent topics manually.
This model, part of the ORES suite of models, analyzes an item to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.
This model may be useful for high-level analyses of Wikidata dynamics (pageviews, item quality, edit trends) and filtering items.
Users and uses
edit- high-level analyses of Wikidata dynamics such as pageview, item quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
- filtering to relevant items — e.g. filter items only to those in the music category.
- definitively establishing what topic an items pertains to
- automated editing of items or topics without a human in the loop
This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikidata, platform research, and other on-wiki tasks.
Example API call:https://ores.wikimedia.org/v3/scores/wikidatawiki/1907686315/itemtopic
Ethical considerations, caveats, and recommendations
edit- This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
- This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikidata has known biases in its text, this model may encode and at times reproduce those biases.
- This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.
Model
editPerformance
editTest data confusion matrix:
Test data confusion matrix | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Test data sample rates:
Test data sample rates | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Test data performance:
Test data performance | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Implementation
editModel architecture |
---|
{
"type": "GradientBoosting",
"params": {
"scale": false,
"center": false,
"labels": [
"Culture.Biography.Biography*",
"Culture.Biography.Women",
"Culture.Food and drink",
"Culture.Internet culture",
"Culture.Linguistics",
"Culture.Literature",
"Culture.Media.Books",
"Culture.Media.Entertainment",
"Culture.Media.Films",
"Culture.Media.Media*",
"Culture.Media.Music",
"Culture.Media.Radio",
"Culture.Media.Software",
"Culture.Media.Television",
"Culture.Media.Video games",
"Culture.Performing arts",
"Culture.Philosophy and religion",
"Culture.Sports",
"Culture.Visual arts.Architecture",
"Culture.Visual arts.Comics and Anime",
"Culture.Visual arts.Fashion",
"Culture.Visual arts.Visual arts*",
"Geography.Geographical",
"Geography.Regions.Africa.Africa*",
"Geography.Regions.Africa.Central Africa",
"Geography.Regions.Africa.Eastern Africa",
"Geography.Regions.Africa.Northern Africa",
"Geography.Regions.Africa.Southern Africa",
"Geography.Regions.Africa.Western Africa",
"Geography.Regions.Americas.Central America",
"Geography.Regions.Americas.North America",
"Geography.Regions.Americas.South America",
"Geography.Regions.Asia.Asia*",
"Geography.Regions.Asia.Central Asia",
"Geography.Regions.Asia.East Asia",
"Geography.Regions.Asia.North Asia",
"Geography.Regions.Asia.South Asia",
"Geography.Regions.Asia.Southeast Asia",
"Geography.Regions.Asia.West Asia",
"Geography.Regions.Europe.Eastern Europe",
"Geography.Regions.Europe.Europe*",
"Geography.Regions.Europe.Northern Europe",
"Geography.Regions.Europe.Southern Europe",
"Geography.Regions.Europe.Western Europe",
"Geography.Regions.Oceania",
"History and Society.Business and economics",
"History and Society.Education",
"History and Society.History",
"History and Society.Military and warfare",
"History and Society.Politics and government",
"History and Society.Society",
"History and Society.Transportation",
"STEM.Biology",
"STEM.Chemistry",
"STEM.Computing",
"STEM.Earth and environment",
"STEM.Engineering",
"STEM.Libraries & Information",
"STEM.Mathematics",
"STEM.Medicine & Health",
"STEM.Physics",
"STEM.STEM*",
"STEM.Space",
"STEM.Technology"
],
"multilabel": true,
"population_rates": null,
"ccp_alpha": 0.0,
"criterion": "friedman_mse",
"init": null,
"learning_rate": 0.1,
"loss": "deviance",
"max_depth": 5,
"max_features": "log2",
"max_leaf_nodes": null,
"min_impurity_decrease": 0.0,
"min_impurity_split": null,
"min_samples_leaf": 1,
"min_samples_split": 2,
"min_weight_fraction_leaf": 0.0,
"n_estimators": 150,
"n_iter_no_change": null,
"presort": "deprecated",
"random_state": null,
"subsample": 1.0,
"tol": 0.0001,
"validation_fraction": 0.1,
"verbose": 0,
"warm_start": false,
"label_weights": {}
}
}
|
Output schema |
---|
{
"title": "Scikit learn-based classifier score with probability",
"type": "object",
"properties": {
"prediction": {
"description": "The most likely labels predicted by the estimator",
"type": "array",
"items": {
"type": "string"
}
},
"probability": {
"description": "A mapping of probabilities onto each of the potential output labels",
"type": "object",
"properties": {
"Culture.Biography.Biography*": {
"type": "number"
},
"Culture.Biography.Women": {
"type": "number"
},
"Culture.Food and drink": {
"type": "number"
},
"Culture.Internet culture": {
"type": "number"
},
"Culture.Linguistics": {
"type": "number"
},
"Culture.Literature": {
"type": "number"
},
"Culture.Media.Books": {
"type": "number"
},
"Culture.Media.Entertainment": {
"type": "number"
},
"Culture.Media.Films": {
"type": "number"
},
"Culture.Media.Media*": {
"type": "number"
},
"Culture.Media.Music": {
"type": "number"
},
"Culture.Media.Radio": {
"type": "number"
},
"Culture.Media.Software": {
"type": "number"
},
"Culture.Media.Television": {
"type": "number"
},
"Culture.Media.Video games": {
"type": "number"
},
"Culture.Performing arts": {
"type": "number"
},
"Culture.Philosophy and religion": {
"type": "number"
},
"Culture.Sports": {
"type": "number"
},
"Culture.Visual arts.Architecture": {
"type": "number"
},
"Culture.Visual arts.Comics and Anime": {
"type": "number"
},
"Culture.Visual arts.Fashion": {
"type": "number"
},
"Culture.Visual arts.Visual arts*": {
"type": "number"
},
"Geography.Geographical": {
"type": "number"
},
"Geography.Regions.Africa.Africa*": {
"type": "number"
},
"Geography.Regions.Africa.Central Africa": {
"type": "number"
},
"Geography.Regions.Africa.Eastern Africa": {
"type": "number"
},
"Geography.Regions.Africa.Northern Africa": {
"type": "number"
},
"Geography.Regions.Africa.Southern Africa": {
"type": "number"
},
"Geography.Regions.Africa.Western Africa": {
"type": "number"
},
"Geography.Regions.Americas.Central America": {
"type": "number"
},
"Geography.Regions.Americas.North America": {
"type": "number"
},
"Geography.Regions.Americas.South America": {
"type": "number"
},
"Geography.Regions.Asia.Asia*": {
"type": "number"
},
"Geography.Regions.Asia.Central Asia": {
"type": "number"
},
"Geography.Regions.Asia.East Asia": {
"type": "number"
},
"Geography.Regions.Asia.North Asia": {
"type": "number"
},
"Geography.Regions.Asia.South Asia": {
"type": "number"
},
"Geography.Regions.Asia.Southeast Asia": {
"type": "number"
},
"Geography.Regions.Asia.West Asia": {
"type": "number"
},
"Geography.Regions.Europe.Eastern Europe": {
"type": "number"
},
"Geography.Regions.Europe.Europe*": {
"type": "number"
},
"Geography.Regions.Europe.Northern Europe": {
"type": "number"
},
"Geography.Regions.Europe.Southern Europe": {
"type": "number"
},
"Geography.Regions.Europe.Western Europe": {
"type": "number"
},
"Geography.Regions.Oceania": {
"type": "number"
},
"History and Society.Business and economics": {
"type": "number"
},
"History and Society.Education": {
"type": "number"
},
"History and Society.History": {
"type": "number"
},
"History and Society.Military and warfare": {
"type": "number"
},
"History and Society.Politics and government": {
"type": "number"
},
"History and Society.Society": {
"type": "number"
},
"History and Society.Transportation": {
"type": "number"
},
"STEM.Biology": {
"type": "number"
},
"STEM.Chemistry": {
"type": "number"
},
"STEM.Computing": {
"type": "number"
},
"STEM.Earth and environment": {
"type": "number"
},
"STEM.Engineering": {
"type": "number"
},
"STEM.Libraries & Information": {
"type": "number"
},
"STEM.Mathematics": {
"type": "number"
},
"STEM.Medicine & Health": {
"type": "number"
},
"STEM.Physics": {
"type": "number"
},
"STEM.STEM*": {
"type": "number"
},
"STEM.Space": {
"type": "number"
},
"STEM.Technology": {
"type": "number"
}
}
}
}
}
|
https://ores.wikimedia.org/v3/scores/wikidatawiki/1907686315/itemtopic
Output:
Example output |
---|
{
"wikidatawiki": {
"models": {
"itemtopic": {
"version": "1.2.0"
}
},
"scores": {
"1907686315": {
"itemtopic": {
"score": {
"prediction": [
"STEM.STEM*"
],
"probability": {
"Culture.Biography.Biography*": 0.009059893345632097,
"Culture.Biography.Women": 0.0006924491258526178,
"Culture.Food and drink": 0.0006399242658997215,
"Culture.Internet culture": 0.0009384780459913412,
"Culture.Linguistics": 0.0018606277391225432,
"Culture.Literature": 0.003990388751181737,
"Culture.Media.Books": 0.0006214752106656115,
"Culture.Media.Entertainment": 0.001104834881085509,
"Culture.Media.Films": 0.0011465477594696284,
"Culture.Media.Media*": 0.009497882960118977,
"Culture.Media.Music": 0.0005314326820035878,
"Culture.Media.Radio": 0.0001418663128807519,
"Culture.Media.Software": 0.0006122966374525156,
"Culture.Media.Television": 0.0011153877562536376,
"Culture.Media.Video games": 0.0006372239889671269,
"Culture.Performing arts": 0.0006531356388159476,
"Culture.Philosophy and religion": 0.01399521934257544,
"Culture.Sports": 0.0018462250677368348,
"Culture.Visual arts.Architecture": 0.0016560396166840437,
"Culture.Visual arts.Comics and Anime": 0.0005305955236163667,
"Culture.Visual arts.Fashion": 0.000537788411976724,
"Culture.Visual arts.Visual arts*": 0.009907875401930734,
"Geography.Geographical": 0.01571363482516823,
"Geography.Regions.Africa.Africa*": 0.020280349224975614,
"Geography.Regions.Africa.Central Africa": 0.0007006250310735848,
"Geography.Regions.Africa.Eastern Africa": 0.000981468869640802,
"Geography.Regions.Africa.Northern Africa": 0.015712323087656205,
"Geography.Regions.Africa.Southern Africa": 0.001221937118377821,
"Geography.Regions.Africa.Western Africa": 0.0008305320623083369,
"Geography.Regions.Americas.Central America": 0.001306842712476455,
"Geography.Regions.Americas.North America": 0.030570993625411366,
"Geography.Regions.Americas.South America": 0.009381192562807516,
"Geography.Regions.Asia.Asia*": 0.08763779333893186,
"Geography.Regions.Asia.Central Asia": 0.0021630529281042718,
"Geography.Regions.Asia.East Asia": 0.01090968773383821,
"Geography.Regions.Asia.North Asia": 0.029951228290233667,
"Geography.Regions.Asia.South Asia": 0.005977584426712786,
"Geography.Regions.Asia.Southeast Asia": 0.0028688045552628266,
"Geography.Regions.Asia.West Asia": 0.0009526856502617891,
"Geography.Regions.Europe.Eastern Europe": 0.029972291587851183,
"Geography.Regions.Europe.Europe*": 0.13296776378542635,
"Geography.Regions.Europe.Northern Europe": 0.016907973275604154,
"Geography.Regions.Europe.Southern Europe": 0.005813048163270592,
"Geography.Regions.Europe.Western Europe": 0.005037055635498127,
"Geography.Regions.Oceania": 0.007780720915153282,
"History and Society.Business and economics": 0.005890874106250135,
"History and Society.Education": 0.0017680572320172617,
"History and Society.History": 0.01973006391755843,
"History and Society.Military and warfare": 0.006573635883462243,
"History and Society.Politics and government": 0.007573132449112524,
"History and Society.Society": 0.04381007914549254,
"History and Society.Transportation": 0.002797769913886188,
"STEM.Biology": 0.005780672890531569,
"STEM.Chemistry": 0.0022570835539507676,
"STEM.Computing": 0.0018290751421398967,
"STEM.Earth and environment": 0.0795914195853073,
"STEM.Engineering": 0.004058097854564882,
"STEM.Libraries & Information": 0.0010339015208737487,
"STEM.Mathematics": 0.0017040157655581244,
"STEM.Medicine & Health": 0.005650365932513206,
"STEM.Physics": 0.020150498627265184,
"STEM.STEM*": 0.8790296717258461,
"STEM.Space": 0.11458869168317454,
"STEM.Technology": 0.012381701761546463
}
}
}
}
}
}
}
|
Data
editLicenses
edit- Code: MIT license
- Model: MIT license
Citation
editCite this model card as:
@misc{
Triedman_Bazira_2023_Wikidata_item_topic,
title={ Wikidata item topic model card },
author={ Triedman, Harold and Bazira, Kevin },
year={ 2023 },
url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Wikidata_item_topic }
}