WikiCite 2018/Program/Building a WikiCite corpus
Abstract:
editIn WikiCite contexts, a corpus is a set of Wikidata entries that share some common characteristics, for instance having the same author, translator, language or topic, being cited from the same Wikipedia page or having been published from within a geographic region or within a specific period. In this talk, I will explore some examples of such corpora that have been or are being assembled on Wikidata and highlight how they can be used to improve data quality, data models, tools and workflows or simply to gather a deeper understanding of the relationships between elements of the corpus. These examples include corpora under the auspices of the WikiProjects Wikipedia Sources, Retractions, Invasive Species, Kākāpō as well as Zika Corpus and others.
Presentation
editDefining a WikiCite corpus
editMultiple approaches are possible here; I am just illustrating some.
Primary corpora
edit- things that have been
- authored
- published
- cited
- archived
- used as a reference on Wikimedia platforms
Secondary corpora
edit- things related to things that form primary corpora, e.g.
- topics of things that have been published
- authors of things that have been cited
- people with the same author name string
- events attended by authors
- for instance WikiCite 2018 (see next talk)
- institutions with which authors are/ were affiliated
- e.g. James Mason University people
- things published in a given language
- collections of things that form primary corpora, e.g.
- corpus of things cited from Wikipedia that have a persistent identifier for publications
- items or properties required by things that form primary corpora, e.g.
- Bibliographic properties
- newspaper, book or journal items need publisher items
- see also WikiProject Books or WikiProject Periodicals
- journal article items need journal items
- tools or workflows around things that form primary corpora
- e.g. SourceMD or Using OpenRefine to extract affiliation information from ORCID
- note that work on corpora stimulated the development of such tools
- Wiki(m|p)edia citation templates for things that form primary corpora
- publications licensed compatibly with Wikimedia projects
- good foundation for reuse of text and media in Wikisource, Wikimedia Commons etc.
- see also Ina Blümel's lightning talk
- good foundation for reuse of text and media in Wikisource, Wikimedia Commons etc.
- statements supported by the same source (cf. Dario's slide 17)
- all statements citing
- any article from the New York Times (or Daily Mail, or Berkeley News) (for those working on such issues around newspapers, see: WikiProject Periodicals on Wikidata, and WikiProject Newspapers on English Wikipdia)
- a specific article amongst these
- any works of Joseph Stiglitz
- journal articles by physicists who worked at Oxford University in the 1970s
- a journal article that was retracted
- any article from the New York Times (or Daily Mail, or Berkeley News) (for those working on such issues around newspapers, see: WikiProject Periodicals on Wikidata, and WikiProject Newspapers on English Wikipdia)
- all statements citing
- timelines of things that form primary corpora, e.g.
- things translated by the same translator
- e.g. by Yanka Kupala
- type specimens of biological taxa or minerals
- research published last week
- see also this FORCE 2018 session
What to consider before getting started on a new corpus
edit- What is already there?
- items
- properties
- What about lexemes/ forms/ senses?
- How is it modeled?
- What is the purpose of the existing and new corpora?
- discovery
- e.g. of knowledge, connections, potential collaborators
- quality control
- might involve
- constraint statements
- Shape Expressions
- see also WikiProject ShEx and lightning talks by Eric Prud'hommeaux and Jose Emilio Labra Gayo
- maintenance queries
- for constraints, benchmarks etc. (some examples)
- Scholia
- see also next talk
- might involve
- research assessment
- discovery
- What about starting your corpus as a subset of one of the existing ones?
Notes
edit- How does it related to past, present and future of WikiCite?
- Extends across the three possible scenarios in the WikiCite Roadmap
- create new properties
- revise data models
- write Shape Expressions
- build/ adapt tools and workflows
- SourceMD
Additional considerations
edit- Complete corpora are good for testing purposes, so watch out for
- things that do not change (much any more), e.g.
- all citations from a given version of a publication
- publishers, journals, organizations, authors, countries etc. that do not exist any more
- publications in extinct languages
- things that do not change (much any more), e.g.
- Scholia for quality control
Presenter
editDaniel Mietchen is trained as a biophysicist and now works for the Data Science Institute of the University of Virginia on opening up research and education workflows for large-scale collaboration, including with machines. More details via Scholia or this user page.