Research:Wikidata Item Quality Classification

Created
19:34, 16 October 2014 (UTC)
Contact
Duration:  2016-August – ??
Open source project Open source
no url provided

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Revision_scoring_as_a_service has been predicting article edit quality on Wikipedia which is proving quite useful for editors. This research is an extension of the same to Wikidata item quality classification on the basis of item quality based on a set of metrics.

Item Quality classification is the parent project task on phabricator.

Possible metrics for good quality items

edit

Showcase Items on Wikidata lists the possible metrics for a good quality item on Wikidata. Sourcing form there, a good quality item should have:

  • At least 10 statements with:
    • Sources for non-trivial statements (sources other than Wikimedia projects)
    • Appropriate ranks
    • Qualifiers where applicable
    • Ordered adequately
  • A reasonable set (~4) of completed translations: Labels, descriptions and properties
  • Aliases when appropriate, in each language
  • Sitelinks to a complete and correct set of applicable pages on Wikimedia projects
  • An image associated with it (a plus)

A tentative roadmap

edit

Following would be the basic steps that will be involved in getting a quality classifier up and running for Wikidata(will be expanded further once the project starts):

Discussion

edit

Discussion on Wikidata about what makes an item good quality or bad.

Features

edit

Coming up with a set of actionable features for an item on which to decide its quality.

Labelling

edit

Developing a MediaWiki:Gadget to poll the quality of items from Wikidata users for generation of test and train set.

Machine Learning

edit

Develope a Supervised learning approach to begin with, integrating the proposed actionable features to see the results.

Improvement

edit

Refinement of the model to improve the accuracy of the predictions and to decide which features are useful and which are not.

Quality of Linked Data

edit

Linked data quality has been assessed before and several dimensions identified [1]
Some of them are(in relation to Wikidata):

  • Amount of data - No. of statements can be a criteria for the wealth of information.
  • Reliability - Existence of external links can be a dimension of quality.
  • Interpretability - Existence of language translations for labels could impact the outreach of the item.
  • Classification and format - Whether data is in appropriate unit and the classification of the entity is correct. Also classification should be at appropriate level.
  • Correctness - Label,alias, description is correct or not.
  • Correctness - Whether information is correct or not. E.g. some statements may require the qualification by time period, like president of a nation.

References

edit
  1. Amrapali Zaveri, , Anisa Rula , Andrea Maurino , Ricardo Pietrobon , Jens Lehmann and Sören Auer d (2012). Quality Assessment for Linked Data: A Survey (PDF). Semantic Web 1 (2012) 1–5.