Grants:Project/DBpedia/GlobalFactSyncRE/Timeline
This project is funded by a Project Grant
proposal | people | timeline & progress | finances | midpoint report | final report |
Timeline for DBpedia
editTimeline | Date |
Study (choose two initial sync targets and analyse the lack of references in Wikidata) | Day Month Year |
GlobalFactSync tool (extend the current prototype with new features) | Day Month Year |
Mapping Refinements | Day Month Year |
GlobalFactSync WikiData ingest | Day Month Year |
GlobalFactSync Sprints | Day Month Year |
Monthly updates
editPlease prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.
Current tasks
editA log of current tasks is kept here. Ongoing discussions should be held using the corresponding discussion page.
(Preparation) April/May
edit- getting people on board:
- User:M1ci and Diffbot to work in parallel on extraction of information from article abstracts and external sources.
- User:KrzysztofWecel for the reference extraction
- Paper about the underlying engine: Frey J, Hofer M. Hellmann S, Obraczka D, DBpedia FlexiFusion Best of Wikipedia > Wikidata > Your Data. ISWC Ressource Track 2019 (submitted). Available at: https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf
- We connected with WikidataCon_2019
- Set up gfs@infai.org to collect feedback for our project.
- presentation of the project idea to the participants of the 13th DBpedia Community Meeting to bring awareness to the project
June 2019 (official start)
edit- June 6: first official team meeting
- Submission for Wikimania 2019 to present the project and gather feedback from the wiki community
July 2019
editFirst Release Report: A first release containing detailed information about our micro-services is published on the DBpedia Blog
Containing:
- First success story
- Deployment of first micro-services on the server
- Initial User Interface here
- PreFusion JSON API here (user: read, pw: gfs)
- Reference Extraction Service here
- Reference Data Download here
- Infobox Extraction Service here
- ID service here
- definition of a set of problems with different layers of complexity
- analysis of various groups of subjects with respect to these synchronization problems
August 2019
edit- Continuing improvements of the first deployments, which will be an ongoing process. Especially the GFS Data Browser is being worked on:
- users can now insert any Wikipedia URL into the subject search field
- overall layout improvements
- reference information is being added
- Johannes Frey presented the GFS project at Wikimania
- We created a news page within our Meta-Wiki project page framework for volunteers to keep them in the loop and encourage exchange. So far this has lead to three more volunteers signing up for our 'GFS Feedback Squad' and two users leaving feedback about our sync target study.
September 2019
edit- more work towards sync target study, focus on targets that were brought up by Wikidata users (e.g., geo coordinates, employer, nobel price)
- intensive work on creating the complement to Wikidata and Wikipedia by collecting and providing data that is currently missing in both
October 2019
edit- refining sync target study to look at properties of properties
- brainstorming ways to allow for a fast integration of external sources (e.g. IMDb, DNB (German National Library), MusicBrainz, Polish Census Data)
- creating an overview over popular/most frequent references used in Wikipedia infoboxes and Wikidata, and filtering out the ones with easily downloadable and integrateable data, see http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.09.01/stats/
- Global Fact Sync Poster at WikidataCon 2019
November 2019
edit- re-extraction of GFS data and fusion
- some work on the UI
- identifying and testing ways to generate lists of the Wikipedia articles related to selected topics: categories, infoboxes, Wikidata queries and other articles (lists).
December 2019
edit- extraction of reference data for Polish cities; studied sources: BDL - Bank Danych Lokalnych, Wikipedia, Wikidata
- analysis of available mappings between various geographical identifiers for Polish administrative units
- showing current understanding of the fusion challenge
January 2020
edit- updated Global Fact Sync Browser, including external sources e.g. German National Library, GeoNames, Musicbrainz (deployed at http://dbpedia.informatik.uni-leipzig.de:9015)
- repeat recreation of GFS data (PreFusion dump) and change GnB links from 'http' -> 'https'
- release Statistics about GFS data. For example value source informations e.g overlap and data coverage w.r.t to each source
- update Official GFS browser at http://global.dbpedia.org and include representation of value references (merge external_sources branch into master https://github.com/dbpedia/gfs)
- we filed an RFC at Wikidata: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Harvesting_of_Wikipedia_Infoboxes_for_Wikidata._Proposal_for_extension_of_Harvest_Templates
February 2020
edit- created JSON dump file with references mapped to statements http://infoboxes.net/filez/multiwiki-20200101-properties-refs.jsonl.gz similiar in format to online / ad-hoc reference extraction
- created fresh DBpedia TQL release to realize point 1
- created a list of templates for several languages https://git.informatik.uni-leipzig.de/kwecel/infoboxes-refs/tree/master/infoboxes for reference extraction
- new GFS data dump with integrated DNB data ( https://databus.dbpedia.org/vehnem/flexifusion/prefusion/2020.01.01 ) and added VIAF identifiers and ID management ( https://databus.dbpedia.org/jj-author/id-management/global-ids/2020.01.30) and loaded for GFS browser
- collected new statistics for GFS database https://docs.google.com/spreadsheets/d/1mLEmnJ92qq0RfIXHjIJsSqYpFEjOY3mQGXyILYpiHJA/edit#gid=1657173090
- first experiments to integrate reference extraction into DBpedia Framework
- discussion and extension of harvest template extension mockup
- created and released interactive harvesttemplate mockup http://temporary.dbpedia.org/temporary/harvesttemplates/PLtools_%20Harvest_Templates_mockup.html
- research on Wikidata usage and adoption in infoboxes
- refactoring of GFS browser using VueJS to include new features (e.g. provenance to Databus, reliable links to original sources) and responsive UI
March 2020
edit- experiment prototype for improved harvesttemplate
- index Infoboxes / Templates
April 2020
edit- experiment prototype for improved harvesttemplate
- index Infoboxes / Templates
May 2020
edit- watch for feedback of new mockup
June 2020
edit- incorporate demo (hard-coded) references view into GFS browser using the novel JSON references dump
Planned Next Steps for July, August and September 2020
edit- incorporate demo (hard-coded) references view into GFS browser using the novel JSON references dump
- GFS browser features
- include mapping management to allow search for properties of new external sources
Is your final report due but you need more time?
Extension request
editSeptember 30, 2020
editIn the last month output of our project was quite invisible as we 1. worked a lot on the data 2. had to deal with corona and all its consequences like missing child care. On the good side, we have quite a lot of budget (9000€) left and would like to stretch the project for four months like a budget-neutral extension. We still need time until end of September 2020. Project-wise we found this dump: enwiki-20200401-wbc_entity_usage.sql.gz
- Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used. So we see it realistic to provide the following:
- We have one of the best infobox parsers and we have full information about all properties there. This means we can produce a reliable Wikidata adoption report, which show how much Wikidata is adopted, where it is well adoption in Wikipedia and where it can be improved.
- We can use this to calculate "good imports" from Wikipedia to Wikidata, i.e. where data in WP infoboxes is especially plentiful and well referenced, but missing in Wikidata
- With the improvements on https://tools.wmflabs.org/pltools/harvesttemplates/ we would have a powerful User Interface to exactly tackle these spots
In addition, we started to index authoritative datasets that are often referenced in WP and WD. Taking this data from the source, we can build an interface, e.g. a user script to suggest relevant data points from these data sets to users for inclusion. This part might be experimental, but it would work like this: On https://pl.wikipedia.org/wiki/Pozna%C5%84 Populacja (30.06.2019) • liczba ludności 535 802[3]
[3] is the population count from stat.gov.pl holding the official census for Poland. If this gets updated, we might be able to autodetect that a change is required either in the infobox or on Wikidata (that is up to the community policy).
This will not be complete, but it will probably work for 10-50 million entries in Wikipedia and Wikidata, depending on the quality of the source and how official it is. In the next few month we need to work on the following topics:
- incorporate demo (hard-coded) references view into GFS browser using the novel JSON references dump
- GFS browser features
- include mapping management to allow search for properties of new external sources
- @Juliaholze: Hi Julia, thanks for this request and context over your remaining budget as well as the disruptions you experienced due to the pandemic. We can appreciate that work on the project needed to be paused in order to focus on other, more important priorities, as we have experienced these same needs at the Wikimedia Foundation as well. This extension until 30 September 2020 to complete the above activities is formally approved. Your final report will be due on 30 October 2020. I JethroBT (WMF) (talk) 21:25, 6 July 2020 (UTC)
- @JethroBT (WMF): Hi Chris, many thanks for your reply. We will complete the above activities and tasks.
Extension request
editNew end date
editNovember 30, 2020
Rationale
editWe would like to request another budget-neutral extension. The main reason is very similar to the previous one. We are currently in the process of adding many authoritative datasets to the GFS browser, which will then enable to have "official" data from the appropriate sources to be included into Wikipedia/Wikidata. In the next two months we need to work on the following topics:
- GFS browser features
- include mapping management to allow search for properties of new external sources
Please also see our email to the WMF Grants Administrator.
Approval
editThis request is approved. Your new Project end date is November 30, 2020, and your Final Report is due on December 30, 2020.
Marti (WMF) (talk) 19:08, 15 October 2020 (UTC)
Extension request
editNew end date
editJanuary 31, 2021
Rationale
editSince the beginning of December 2020 we deal again with corona and all its consequences like a national lockdown and missing child care. I am sorry to inform you that we need more time to finish our final report for the GlobalFactSyncRE project. We already started to write the report and we requested bank statements to document all expenses. We need more time to summarize all project results and document the outcome. We hope that you and your families are safe and well, despite the disruptions and consequences of covid. Kind regards, Julia
Approval
editThis request is approved. Your new project end date is January 31, 2021.
--Marti (WMF) (talk) 22:23, 15 January 2021 (UTC)
- Noting here that your new final report due date is 2 March 2021. Thank you. -- JTud (WMF), Grants Administrator (talk) 23:13, 15 January 2021 (UTC)