Machine learning models/Proposed/Wikidata item completeness
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
This model aims to determine the completeness of a given Wikidata item by essentially predicting what remaining properties, labels, and references should be added. It is similar but distinct from the approach taken by prior Wikidata item quality models, which correlate much more strongly with how many claims are present in an item regardless of that item's type (instance-of). It is inspired by approaches like Recoin that recommend missing claims for items and adds in support for labels (based on sitelinks) and references (based on Amaral et al.[1]).
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Isaac Johnson |
Model owner(s) | Isaac Johnson |
Model interface | https://wikidata-quality.wmcloud.org/api/item-scores |
Past performance | task T321224 |
Code | https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/tree/master/annotation-gap |
Uses PII | No |
In production? | No |
Which projects? | Wikidata |
This model uses Wikidata items to predict missing claims, references, and labels for Wikidata items. | |
Motivation
editThis model can serve a few important purposes:
- Estimating item completeness to help understand gaps in the Wikimedia projects. This is not just about Wikidata given that many Wikipedia articles draw important information from their corresponding Wikidata items.
- Helping editors on Wikidata to identify tasks -- having an estimate of completeness combined with some metric for priority (e.g., total number of sitelinks) would enable better recommender systems for Wikidata.
Users and uses
edit- Suggesting properties/labels/references for editors to add (task recommender)
- Estimating Wikidata completeness (analysis)
- Automatically adding properties to Wikidata items
Ethical considerations, caveats, and recommendations
editThis model is based on the properties and references that currently exist on Wikidata. As such, to the degree that Wikidata is incomplete, the estimates of completeness made by the model will also be underestimates. It should be used as a start but will evolve as Wikidata evolves.
Model
editPerformance
editImplementation
edit{
"item": "https://www.wikidata.org/wiki/Q20909",
"predicted-completeness": "D",
"predicted-quality": "A",
"features": {
"claim-completeness": 0.8063980787390058,
"label-desc-completeness": 0.75,
"num-claims": 79,
"ref-completeness": 0.4174470862702438
}
}
Data
edit
Licenses
edit- Code: MIT License
- Model: CC0 License
Citation
editCite this model as:
@misc{name_year_modeltype,
title={Wikidata Item Completeness},
author={Johnson, Isaac},
year={2024},
url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Wikidata_item_completeness}
}
References
edit- ↑ Amaral, Gabriel; Piscopo, Alessandro; Kaffee, Lucie-aimée; Rodrigues, Odinaldo; Simperl, Elena (2021-10-15). "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach". J. Data and Information Quality 13 (4): 23:1–23:35. ISSN 1936-1955. doi:10.1145/3484828.