WikiCite 2018/Program/Building a WikiCite corpus

Abstract:

edit

In WikiCite contexts, a corpus is a set of Wikidata entries that share some common characteristics, for instance having the same author, translator, language or topic, being cited from the same Wikipedia page or having been published from within a geographic region or within a specific period. In this talk, I will explore some examples of such corpora that have been or are being assembled on Wikidata and highlight how they can be used to improve data quality, data models, tools and workflows or simply to gather a deeper understanding of the relationships between elements of the corpus. These examples include corpora under the auspices of the WikiProjects Wikipedia Sources, Retractions, Invasive Species, Kākāpō as well as Zika Corpus and others.

Presentation

edit

Defining a WikiCite corpus

edit

Multiple approaches are possible here; I am just illustrating some.

Primary corpora

edit
  • things that have been
    • authored
    • published
    • cited
    • archived
    • used as a reference on Wikimedia platforms

Secondary corpora

edit

What to consider before getting started on a new corpus

edit
  • What is already there?
    • items
    • properties
    • What about lexemes/ forms/ senses?
  • How is it modeled?
  • What is the purpose of the existing and new corpora?
    • discovery
      • e.g. of knowledge, connections, potential collaborators
    • quality control
      • might involve
        • constraint statements
        • Shape Expressions
        • maintenance queries
          • for constraints, benchmarks etc. (some examples)
        • Scholia
          • see also next talk
    • research assessment
  • What about starting your corpus as a subset of one of the existing ones?

Notes

edit
  • How does it related to past, present and future of WikiCite?
  • create new properties
  • revise data models
  • write Shape Expressions
  • build/ adapt tools and workflows
  • SourceMD

Additional considerations

edit
  • Complete corpora are good for testing purposes, so watch out for
    • things that do not change (much any more), e.g.
      • all citations from a given version of a publication
      • publishers, journals, organizations, authors, countries etc. that do not exist any more
      • publications in extinct languages
  • Scholia for quality control

Presenter

edit
 
Daniel in 2017

Daniel Mietchen is trained as a biophysicist and now works for the Data Science Institute of the University of Virginia on opening up research and education workflows for large-scale collaboration, including with machines. More details via Scholia or this user page.