WikiCite 2016/Report/Group 2/Notes
Links and notes
editGoal
edit- Extractors of bibliographic information, improve metadata lookup tools
Notes
edit- Librarybase
- BILBO
- Automatic adding of DOI based on reference OpenEdition bilbo (Marin, Marseille)
- They have converted a million of references (1,4 million, finding 140 000 DOIs)
- Uses the Crossref service
- Bilbo Demo
- Other approaches for reference extraction:
- AnyStyle: parsing scholarly references into different formats AnyStyle
- This online service attempts to parse a reference into the components. It does not resolve to the DOI
- Grobid: Information Extraction form Scientific Publications https://github.com/kermitt2/grobid
- AnyStyle: parsing scholarly references into different formats AnyStyle
- Finn Årup Nielsen showed information from OpenfMRI and Q21100980 with links to external databases and representation of numerical scientific data, respectively.
- Antonin Delpeuch - OABot
- A webservice that suggests open access a version for a citation in a Wikipedia article based on the information of template
- There are issues of persistency and linking to possible 'pirated' versions.
- University of Trento: Mattia Lago, Alessio Melandri
- Maintgraph on Toolserver (for it.wiki): Maintgraph on Toolserver
- Alessio Bogon
- Study on citation on Wikipedia and Microsoft Academic Graph.
- Pageview dataset for 2014: Wikipedia Pagecounts Sorted by Page Year 2014
- The Citoid tool enables to cite easily within wikipedia by just inputing a URL. In the background the Zotero Translators are used for the information extraction from the webpage and the structured parts are filled within the citation work in Wikipedia.
- Citoid on MediaWiki: Citoid
- Citoid API: Citoid API
- Zotero Translators (>480): Zotero Translators on Github
- Author disambiguation and separation
- Open APIs for ISBN:
- SRU Library of Congress, e.g., lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&query=bath.ISBN=9780820488660&maximumRecords=1
- see also link
- SRU from GBV, e.g. http://sru.gbv.de/gvk?version=1.1&operation=searchRetrieve&query=pica.isb=9780820488660 AND pica.mat%3DB&maximumRecords=1
- see also link
- SRU Library of Congress, e.g., lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&query=bath.ISBN=9780820488660&maximumRecords=1
- One group works on the OAbot Proposal
- First, we want to write down the specs for the bot.
Aaron's notes:
Tom Arrow
- Extraction & fetching metadata from pubmeds
- Using a list of DOIs and other IDs. Feed a script that extracts pubmed central metadata
- Made items in a Wikibase installation.
- Python script runs. Uses SparQL endpoint to check if items exists.
- Items for missing articles (dois) -- or otherwise linked
- Items for missing authors (orchid) -- or otherwise linked
- Items for missing institutions -- or otherwise linked
- Items for missing publishers/journals -- or otherwise linked
James Hare
- Librarybase -- A Wikibase installation
- 100-150k-ish items -- 10-15k articles
- Similar to Wikidata -- But will accept everything that Wikidata doesn't
- Focused on sources and where they are used in Wikipedia
- Has hierarchy of item types
- Source, Author, Publisher, Institution
- A little messy. Lots of stuff from pubmed has been loaded
- Has SparQL query service
- Current load includes all <ref> tags
- Would also include source metadata that is not included in Wikipedia
- What is the growth plan?
- Want to hear what you think. Aaron will dump his DOI data in.
- After we do DOIs, move onto harder things -- like using citoid.
Aaron Halfaker
- Extracting history of scholarly identifier (DOI, arXiv, ISBN, PubMed) additions to Wikipedia
- Have fast code & datasets.
- Want robust metadata extraction -- Want to integrate tarrow's work
- Goal: Integrate and experiment with DOI extraction
Jon
- Play with citations
- paleontologist as a day job
- Get papers rplos (https://github.com/ropensci/rplos) -- There are dumps, but this is pretty fast
Phillipp
- Extracting data citation out of papers (future)
- Zotero translators (Citoid -- URL --> Citation) -- Open source -- Active development
Marin
- Bilbo.openeditionlab.org -- can test with a UI
- Parses a bibliography -- can add DOIs
- Machine learning based approach
- 92% ish recall and 80% ish precision
- Alternatives
- CrossRef service (alegedly) doesn't work as well -- and is "very slow"
- https://anystyle.io/ -- Parses citation stuff
- About two weeks for 1.4 million
- http://lab.hypotheses.org/1532
- "Enrichment process"
- Uses crossref API with parsed stuff
Finn Årup Nielsen (User:Fnielsen)
- Neuroinformatics
- Trying to see if data can be represented in wikis
- For metadata
- Download data from databases and add them to wikidata and then link them back
- Using tools SourceMD and QuickStatements
- Considering making own extractor for PubMed.
- Trying to represent data in scientific papers
- https://wikidata.org/wiki/Q17141282 -- See "numeric value".
- Alternatives
- Datacite registers connections between datasets and publications
- http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6937/pdf/imm6937.pdf
Mike (OCLC)
- Interest in OABot
- Looking at databases that make a connection between an author and a paper (not just a string name)
Antonin
- OABot
- Takes citations in Wikipedia articles and tries to find non-paywall version of the PDF
- Guesses whether a citation links to a paywalled PDF
- Page name --> Reference metadata --> Find's accessible version --> new reference metadata
- What is the minimum amount of information for a positive result?
- Title, date and sometimes authors.
- Bot approval in process -- Maybe semi-automated.
- Also have the option of making it an OAuth tool
- How do we know that it is not a copyright infringing PDF?
- We don't. We rely on the publisher to track down whoever is hosting it illegally.
- Might need to get Lawyers involved. Community might not like it too.
- Maybe could have whitelisted domains.
- Targeting repositories -- might still link to author's homepage
Sebastian
- Data-repositorian by day
- Translator leads on Zotero & Citation style language
- Citoid demo
- (Nytimes page) --> VE "Cite
- Ref tool bar has ISBN -- why no ISBN?
- Probably just not implemented yet. Also, WorldCat is bad. See "library of congress sru API".
- Can just run Zotero code on nodejs?
- No. Lots of integration with the browser
Diego
- Learning ecosystem
- Worked with Wikipedia data for studies of communities
- Information retrieval and text mining
- Writing crawlers, parsers, and backend services
Cristian
- Started on template usage data
- Parsed the whole dump and extracted this data
- tools.wmflabs.org/maintgraph -- Queries the database once per day
- Ongoing study
- Looked at citation identifiers -- DOI, Arxiv, ISBN, Pubmed -- and compared with MS Academic Graph
- Looked at first introduction in Wikipedia
- Published dataset
- Pagecounts -- reshuffled
- Allows for computation of journal and conference based on pageviews
Scott
- Here to learn
- Make tools for researchers to get data
- Targeting R. Used to work with Ironholds.
Day 2
editI suggest that we focus on one specific project. People from UK and US are invited to speak slowly for non native English speakers. Thank you ! LoC ISBN lookup:
- //Sends an SRU formatted as CQL to the library of Congress asking for marcXML back
Request that works: http://lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&maximumRecords=1&query=bath.ISBN={ISBN} # Full Marc21 data http://lx2.loc.gov:210/LCDB?operation=searchRetrieve&version=1.1&maximumRecords=1&recordSchema=dc&query=bath.ISBN={ISBN} #Summarized, Human readable
- LEGAL ISSUES: https://www.loc.gov/legal/
- "We reserve the right to block IP address that fail to honor our websites’ robot.txt files, or submit requests at a rate that negatively impacts service delivery to patrons. Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. We also reserve the right to terminate programs that require more than 24 hours to complete. "
//search the ISBN over the SRU of the GBV, and take the result it as MARCXML
- //documentation: https://www.gbv.de/wikis/cls/SRU
- var url = "http://sru.gbv.de/gvk?version=1.1&operation=searchRetrieve&query=pica.isb=" + queryISBN + " AND pica.mat%3DB&maximumRecords=1";
Citoid Group etherpad: https://etherpad.wikimedia.org/p/wikicite-citoid
- Marc21 Mapping https://www.loc.gov/marc/bibliographic/ecbdlist.html
DOI metadata lookup:
$ cat datasets/500_dois.tsv | python demonstrate_doi_fetch_performance.py Running against api.crossref.org ..e..e.................e................................e............................e..........e...ee.........................e.............e...........................................................................................e..ee...........................................e.................................e.............................................................................................................e..................e....................................................... Processing 500 DOIs took 32.718 seconds. So, ~0.065 seconds per lookup Running against doi.org ..e..e.................e................................e............................e..........e...ee.........................e.............e...........................................................................................e..ee...........................................e.................................e.............................................................................................................e..................e....................................................... Processing 500 DOIs took 160.348 seconds. So, ~0.321 seconds per lookup Running against citoid.wikimedia.org ..e..e.................e................................e............................e..........e...ee..e..e...................e.............e...........................................................................................e..ee.......................e...................e.................................e.............................................................................................................e..................e....................................................... Processing 500 DOIs took 1718.462 seconds.
So, ~3.437 seconds per lookup
>1s lookup = morally wrong
So, if we're going to extract metadata for ~601k DOIs, that will take ~29 days. I'll be checking into using a threadpool to speed this up. It might be OK with the citoid folks.
doi.org lookup code: https://gist.github.com/halfak/7113348ab3496a3af3b7c2b2de14a526
OAbot group:
- we have expanded the documentation of the project, now centralized here: http://en.wikipedia.org/wiki/Wikipedia:OABOT
- we have added more debugging information to the web interface
- we have tested the software and identified some bugs, some were corrected, some are being corrected.
mwlinks:
- library and command-line tool for extracting wikilinks from XML Wikipedia dump history files;
- https://github.com/mediawiki-utilities/python-mwlinks