Research:Towards Modeling Citation Quality

Created
12:52, 21 May 2018 (UTC)
Duration:  2018-May – 2018-?
citations, references, accessibility, machine learning
This page documents a completed research project.

Idea

edit

We would like to understand and map the quality of citations and references in Wikipedia. Reference 'quality' is a broad notion including: reliability, accessibility, neutrality, etc. In this project, we want to map a set of citation dimensions, towards the complete understanding of citation quality.

Citation Quality Dimensions

edit

Towards modeling a full, rich notion of citation quality, we are exploring topic distribution and accessibility of citations look across different languages.

Topics

edit

First, we define a topic for each publication, by:

  • Collecting all articles where a publication is cited
  • For articles in Wikipedia editions other than enwiki, find the corresponding article in enwiki. This is done by finding the Wikidata item corresponding to teach article, then retrieving the enwiki page linked from that Wikidata item.
  • Assigning a topic to each article, in the top of the WikiProject hierarchy, using Scoring Platform's draftopic tool

Accessibility

edit

We mark each publication (doi type) as Open Access or Closed Access as follows:

  • We download the dataset from Unpaywall, containing, for each doi publication, a reference to its open access version, if any.
  • We match the Unpaywall dataset with the entries in our data, and assign an accessibility label to each of them

Data

edit

We published a dataset of citations with identifiers. There is a file for each Wikipedia edition (e.g. english Wikipedia, Farsi Wikipedia). Each line contains the following tab-separated values:

page_id,page_name,revision_id,timestamp,publication_type,publication_id,topic,open_access,open_access_url

Example:

249,Xenon,3771178,2010-01-20T12:40:23Z,doi,10.1093/bmb/ldh034,STEM.Chemistry,True,https://academic.oup.com/bmb/article-pdf/71/1/115/772965/ldh034.pdf
  • page_id - the id of the page in Wikipedia that is citing the publication
  • page_name - the title of the page citing the publication
  • revision_id - the id of the revision where the citation has been added
  • timestamp - the time when the revision has been saved
  • publication_type - the type of the publication cited, it can be: isbn,doi,pmid,pmc,arxiv
  • publication_id - the identifier of the publication, format differs according to the type
  • topic - publication topic inherited from the pages where it is cited
  • open_access - boolean: true if the publication is open access, false if it is not open access [for DOI publications only]
  • open_access_url - the url of the open access version of the publication, if 'open_access' is true

Visualizations

edit

We produced some visualizations to show the distribution of publication by topics and accessibility. (Non-interactive) examples below.