Research:Wikidata Item Quality Classification
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
Revision_scoring_as_a_service has been predicting article edit quality on Wikipedia which is proving quite useful for editors. This research is an extension of the same to Wikidata item quality classification on the basis of item quality based on a set of metrics.
Item Quality classification is the parent project task on phabricator.
Possible metrics for good quality items
editShowcase Items on Wikidata lists the possible metrics for a good quality item on Wikidata. Sourcing form there, a good quality item should have:
- At least 10 statements with:
- Sources for non-trivial statements (sources other than Wikimedia projects)
- Appropriate ranks
- Qualifiers where applicable
- Ordered adequately
- A reasonable set (~4) of completed translations: Labels, descriptions and properties
- Aliases when appropriate, in each language
- Sitelinks to a complete and correct set of applicable pages on Wikimedia projects
- An image associated with it (a plus)
A tentative roadmap
editFollowing would be the basic steps that will be involved in getting a quality classifier up and running for Wikidata(will be expanded further once the project starts):
Discussion
editDiscussion on Wikidata about what makes an item good quality or bad.
Features
editComing up with a set of actionable features for an item on which to decide its quality.
Labelling
editDeveloping a MediaWiki:Gadget to poll the quality of items from Wikidata users for generation of test and train set.
Machine Learning
editDevelope a Supervised learning approach to begin with, integrating the proposed actionable features to see the results.
Improvement
editRefinement of the model to improve the accuracy of the predictions and to decide which features are useful and which are not.
Quality of Linked Data
editLinked data quality has been assessed before and several dimensions identified [1]
Some of them are(in relation to Wikidata):
- Amount of data - No. of statements can be a criteria for the wealth of information.
- Reliability - Existence of external links can be a dimension of quality.
- Interpretability - Existence of language translations for labels could impact the outreach of the item.
- Classification and format - Whether data is in appropriate unit and the classification of the entity is correct. Also classification should be at appropriate level.
- Correctness - Label,alias, description is correct or not.
- Correctness - Whether information is correct or not. E.g. some statements may require the qualification by time period, like president of a nation.
References
edit- ↑ Amrapali Zaveri, , Anisa Rula , Andrea Maurino , Ricardo Pietrobon , Jens Lehmann and Sören Auer d (2012). Quality Assessment for Linked Data: A Survey (PDF). Semantic Web 1 (2012) 1–5.