Grants talk:Project/DBpedia/GlobalFactSyncRE/Timeline/Tasks
Documentation of the current prefusion-dump/MongoDB setup
editDocumentation of the current prefusion-dump/MongoDB setup under https://git.informatik.uni-leipzig.de/gfs/main/blob/master/global.dbpedia.org.md. by Marvin.
Tina Schmeissner (talk) 13:14, 11 June 2019 (UTC)
Sebastian Hellmann commented here:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#A_new_use_for_Wikidata_external_IDs_in_Wikipedia_(template)
Tina Schmeissner (talk) 13:21, 11 June 2019 (UTC)
Challenge
editWe want to announce a challenge to hopefully find a good intern to work on the project. A first draft can be found here: challenge. Any ideas for improvement are welcome. Tina Schmeissner (talk) 12:10, 14 June 2019 (UTC)
Initial version of references extraction from infoboxes
editemail from Krzysztof:
We have an initial version of references extraction from infoboxes. The project URL is https://git.informatik.uni-leipzig.de/kwecel/infoboxes-refs
So far the script extracts raw references, i.e. without further parsing.
It just puts what is available between <ref>
</ref>
. Please not that some
references have their names, hence we leave just names with the goal to
further processing during the extraction phase. Moreover, it is more
convenient for potential joining with another table in which we could
extract reference once and use in many places.
The following columns can be found in the output.
1- Wikipedia_article: name/title of the Wikipedia article
2- Infobox_name: name of the infobox; list of infoboxes is contained in a separate directory and was prepared based on analysis what template is really an infobox
3- Parameter_name: raw property in DBpedia notion; identifies row in an infobox
4- Reference_name: name of the reference, if provided; if not, the following value is used instead: "<noname_ref>"; names are unique only within given article; sometimes reference names is defined outside of an infobox
5- Reference_direct_code: raw code, as explained above; this is main input for further development
Włodek will upload the code. There are also some examples in output
folder - ca. 10000 rows for selected languages. We can upload the
samples just for overview directly to gitlab. For full dumps we need to
discuss the destination. Where data should be uploaded?
Factual Consensus Finder - UI
editI understand what the FCF does, but there are still a bunch of questions:
1. How or where do I enter the subject / entity that the infobox belongs to on the page? Do I always need the DBpedia identifier?
2. How will the user be able to reach this page from a Wikipedia page? I assume ideal case scenario would be if eventually there was a link to the FCF page somewhere in the infoboxes.
3. Using DBpedia as an example:
predicate | # of values and sources | questions | Feedback from Marvin |
---|---|---|---|
description | 1 | Result is “semantic web” for German wiki, but this is not shown anywhere in the infobox of the German wiki. | “semantic web” is listed in the IB with the predicate "Beschreibung", but not shown in the actual IB |
latest release version | 5 | First value is empty, with 4 wikis as sources. | There are empty but valid triples being extracted |
developer | 5 | Why are the universities listed in all these languages (why not just in the language of the respective wiki?), and why are they linked to their respective FCF pages? | not yet discussed |
Two docs about fixing mappings
edityou can also see https://docs.google.com/document/d/1yZLNKZ802pC-U0PYMqnyem9KZn5qADccXR2Te2wlr6Q/edit and https://svn.aksw.org/papers/2018/SAC_DBpedia_mappings_alignment/public.pdf Sent from Dimitris, 13:00, 8 July 2019 (UTC)
MusicBrainz - SameAs Problem
editFound this paper: Automatic Interlinking of Music Datasets on the Semantic Web ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-369/paper18.pdf SebastianHellmann (talk) 08:16, 9 July 2019 (UTC)
DBpedia extractor + Infobox references exctractor
editExample on extracting references from article about Facebook in English Wikipedia:
- http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json
- Works for: de, en, es, fr, it, nl, pl, pt, ru, sv
- TODO: clear list of all related infoboxes in each languages (with redirects)
DBpedia extraction framework on this page:
- http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?title=Facebook&format=json&extractors=mappings
- Problems: there a lot of parameters which are not extracted. Examples:
- there is no parameter "rww źródło" in article about Aceton in PL Wiki: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/pl/extract?title=Aceton&revid=&format=json&extractors=custom
- Specific structure of the Taxobox in FR Wiki: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/fr/extract?title=Lasaeidae&revid=&format=json&extractors=custom
Updates
edit- Now it is possible to see each parameter of the citation templates (if exists): http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json
- and also parser can use data from DBpedia extraction framework with custom option (adding '&dbpedia' to the URL): http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json&dbpedia
- Problems:
- parameter 'spouse' has two values (names) and each value has additional data (dates)
- parameter 'award' not parsed correctly (there is list in template 'Plainlist') -> values and reference 'frs' not found.
--Lewoniewski (talk) 09:06, 12 August 2019 (UTC)
Upgraded version of Python Infobox Reference Extractor (PIRE):
- updated list of infoboxes
- works with "r" templates (added new parameter "Reference_mode" for such templates): http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://pl.wikipedia.org/wiki/Andrespol&format=json&dbpedia
- and others. --Lewoniewski (talk) 20:08, 8 September 2019 (UTC)
Extraction statistics in September 2019:
- http://stats.infoboxes.net
- Presentation at SEMANTiCS 2019, 14th DBpedia Community Meeting in Karlsruhe
- all extracted reference data in JSON: http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.09.01/
citation_id
editIn both versions of the parser for each citation template special 'citation_id' parameter is generated based on values of one of the following citation template parameters:
- doi -> http://doi.org/...
- jstor -> https://jstor.org/stable/...
- pmc -> https://ncbi.nlm.nih.gov/pmc/articles/PMC...
- pmid -> https://ncbi.nlm.nih.gov/pubmed/...
- arxiv -> http://arxiv.org/abs/...
- isbn -> http://books.google.com/books?vid=ISBN...
- issn -> https://worldcat.org/ISSN/...
- oclc -> https://worldcat.org/oclc/...
- url -> http....
- website -> http....
The order is important - depending on which parameter is found first, parser will generate appropriate ID. If there is no such parameters, parser generate id with the hash 'http://citation.dbpedia.org/hash2/...' based on the 'title' parameter or (if empty) based on citation template content. --Lewoniewski (talk) 09:05, 12 August 2019 (UTC)
References names/metadata
edit- <ref name="" />
- https://en.wikipedia.org/wiki/Template:R
- Specific templates for selected sources (metadata not directly available):
- https://fr.wikipedia.org/wiki/Mod%C3%A8le:Bioref
- https://pl.wikipedia.org/wiki/Szablon:FP9
- many other, for example: https://en.wikipedia.org/wiki/Category:Chemistry_citation_templates or https://en.wikipedia.org/wiki/Category:Specific-source_templates
- Is there other options? --Lewoniewski (talk) 10:32, 4 September 2019 (UTC)
Errors handling in wikicode
edit- There is no pair of brackets for template in the infobox about Warszaw in Polish Wikipedia (this revision):
|rok = |liczba ludności = 1 777 972 (31.12.2018)</small><ref name="GUS 2018">{{Cytuj stronę |url = http://demografia.stat.gov.pl/bazademografia/Tables.aspx</ref> |gęstość zaludnienia = 3412 <small>(1.01.2018)</small><ref name="GUS 2018" />
- There is no "=" between name and value of parameter. Example on wiceprezydent parameter from this revision:
|pierwsza dama = [[Margarita Penón]] |wiceprezydent<br />1. [[Jorge Manuel Dengo Obregón]] (1986-1990)<br />2. [[Victoria Garrón Orozco]](1986-1990)<br />1. [[Laura Chinchilla]] (2006-2010)<br />2. [[Kevin Casas Zamora]] (2006-2010) | quote =
- Parameters separator in a wrong place (this revision):
'''R5 (silnik)|R5'''
- Pay attention to (in code with comment PPnPP):
- length of parameter name of the infobox.
- length of parameter value and number of the references.
- Pay attention to (in code with comment PPnPP):
- Must to be taken into the account - large value of the parameter trasa in the infobox Droga krajowa nr 11 (Polska).
URLs extraction from references
editWikipedia infoboxes
editHere are statistics of extraction of references URLs from infoboxes in different Wikipedia languages (based on dumps from September 2019):
Files with "_domains" shows domain usage frequency in the references, "all_domains.txt" - summation of results from all considered language versions of Wikipedia.
Wikidata
editSimilar statistics for Wikidata (based on dumps from October 2019):
In files with "_unique" - only unique URL in references per Wikidata item was taken into the account.