Grants:Project/MFFUK/Wikidata & ETL/Timeline
This project is funded by a Project Grant
proposal | people | timeline & progress | finances | midpoint report | final report |
Timeline for MFFUK
editTimeline | Date |
Analysis done | 30 06 2019 |
Wikimania Demo | 18 08 2019 |
Proof of concept transformations done | 31 10 2019 |
Documentation done | 30 11 2019 |
Monthly updates
editPlease prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.
March
editAgreement signed, the actual start of project scheduled for April due to the need for signing contracts with the University.
April
edit- We have our own Wikibase instance set up for testing our bulk loading processes.
- We have successfully created the first few Wikibase items via a pipeline in LinkedPipes ETL (LP-ETL), showing where some development towards optimization needs to be done.
- We had a technical meeting with Czech Wikidata contributors, discussing possible approaches, pitfalls and potential new data sources for the project.
- We have identified necessary improvements in LP-ETL to provide a better user experience when setting up LP-ETL and while debugging and started on their implementation. Specifically, it is now possible to browse debug data via HTTP (previously only via FTP), which will be useful to pipeline developers.
May
edit- We have further analysed the Wikibase API and tokens handling, resulting in a more complex LinkedPipes ETL pipeline (screenshot attached). It works like this:
- Get data from its source
- Query Wikibase Blazegraph instance for existing items
- Create non-existent items
- Update items (both pre-existing and newly created)
- The pipeline seems rather complex. But this is due to the nature of the Wikibase API, which is primarily focused on manual webpage-based edits, not machine to machine interaction.
- We attended the Wikimedia Hackathon 2019 where we met with developers of the Wikibase API and Wikidata to discuss our approach.
- They confirmed that our strategy is correct and showed interest in LinkedPipes ETL
- They also confirmed that the identified API/token issues are by design, intentional, due to the preference of manual curation of Wikidata items over bots, leaving the handling of the rather inconvenient bulk load (mass import) process to libraries and bots to overcome - as a barrier against mass edits by non-experts.
- They indicated interest in becoming the users of our proof-of-concept
- Wikidata Toolkit was analysed and so far it seems it will be used as a library to deal with the Wikibase API issues
June
edit- The analysis and requirements document - output of work package 1 - was created and published
- Initial work has begun on implementation of the new Wikibase loader component in LinkedPipes ETL
- The original pipeline could be simplified significantly using the new component as can be seen in the attached screenshot
- We are registered for Wikimania 2019, where we have a workshop accepted. In addition, we will present a poster about the project. See you in Stockholm!
July
edit- We have created a Poster representing the process of loading RDF data into Wikibases such as Wikidata for Wikimania 2019
- We are continuing in implementation of the Wikibase loader component. Specifically, we now have support for complex data types (quantity, geo, timevalue), somevalue, novalue, and initial support for references and qualifiers.
- There is a teaser for our presentation at Wikimania at the LinkedPipes ETL news feed.
August
edit- We prepared for the Wikimania 2019 demo workshop
- We attended Wikimania 2019 - Feedback from the poster session and the Demo workshop was positive
- We implemented most of the Wikidata RDF data format in the Wikibase loader LP-ETL component - it is now ready to be used in actual pipelines
- We also contributed to Wikidata Toolkit - the library our component uses, fixing a bug introduced with the recent MediaWiki version
September
edit- LP-ETL has been dockerized, so now it can be deployed easily
- During our work on the proof of concept data loading pipelines we identified a usability problem when working with Wikidata statements with multiple references. Therefore, we added a new mode of loading to the component, which merges statements with references instead of replacing them (and possibly loosing references)
- We are now in the process of gaining a bot permission for production loading of data about Veteran trees in the Czech Republic into Wikidata
October
edit- LP-ETL dockerization improved - now it does not need to run as a root user
- Most of the work on resuming long running loads has been done
- We came into contact with the Theatre institute, which expressed interest in loading its data to Wikidata, and we will present our results to them in November
- We are a bit behind on the proof of concept transformations originally planned to be done in October mainly because we had to attend multiple conferences in October. This will be fixed in November.
November
edit- Proof of concept pipelines
- The November Wikidata Query Service lag complicates the development of proof-of-concept pipelines
- Proof of concept pipeline loading data about Czech Remarkable Trees from the authoritative source has been approved .
- Proof of concept pipeline loading data about Czech streets has been approved and has successfully run.
- Proof of concept pipeline linking languages in Wikidata to languages in Language EU Vocabulary was developed and has been approved and has run several times now.
- A volunteer, Martin Nečaský, created a pipeline based on the tutorial, loading data from Arts and Theatre Institute about theatres, approval pending.
- Documentation and Communication
- Documentation of the LP-ETL component has been significantly updated
- Based on the Remarkable trees pipeline, a tutorial documenting our approach was created
- A blog post about our experiences during the development of the transformation pipelines was included in the tutorial at LP-ETL website
- The Wikidata GLAM Facebook group was notified about the tutorial
Is your final report due but you need more time?