Grants:IEG/StrepHit: Wikidata Statements Validation via References/Renewal/Final
This project is funded by an Individual Engagement Grant
This Individual Engagement Grant is renewed
renewal scope | timeline & progress | finances | midpoint report | final report |
Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 12-month project.
Part 1: The Project
editSummary
editIn a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.
Planned outcomes, based on the scope and the timeline:
- Primary sources tool back end version 2: a Wikidata Query Service module;
- Primary sources tool front end version 2: a MediaWiki extension;
- StrepHit confident dataset version 2: 497,247 statements, 3,326,446 RDF triples;
- StrepHit supervised dataset version 2: 574,143 statements, 4,040,460 RDF triples.
Extra outcomes, besides the scope:
- QuickStatements to Wikidata RDF converter: transform a community-specific format to a mature Web standard;
- Dataset booster: elevate simple reference URLs to fully structured references;
- Several contributions to Wiki pages.
- Get all the resources here: #Project_resources
Methods and activities
editWhat did you do in your project?
Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 6 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.
The project has been conducted in the same way as Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal/Midpoint#Methods_and_activities.
Software development
editThe team has almost fully focused on the release of the primary sources tool (PST
from now on) version 2, which required work until the very end of the project.
Datasets
edit- We still managed to publish version 2 of the StrepHit datasets, with refreshed URLs and fully structured references;
- we have started a discussion with Tpt and Lydia_Pintscher_(WMDE) to understand why the import of Freebase has stalled, and came up with two main causes:
- the datasets have quality issues, with lots of unreferenced statements and blacklisted references;
- version 1 of the PST was pretty much unusable.
In this project, we tried to tackle the latter reason, while the former is out of scope, and left open in the Related to Freebase column on the Phabricator work board: phab:project/board/2788/.
Side projects
editBesides the PST, we:
- completed the QuickStatements to Wikidata RDF converter (see Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal/Midpoint#Side project: QuickStatements to Wikidata RDF converter);
- developed the dataset booster, a script that enhances the references of a dataset in QuickStatements format. Among its features, the URL validation can serve as a starting point for the validator component of the upcoming soweego project (see Grants:Project/Hjfocs/soweego#Project_plan).
Outcomes and impact
editOutcomes
editWhat are the results of your project?
Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.
Planned
editPlease use the below table to:
- List each of your original measures of success (your targets) from your project plan.
- List the actual outcome that was achieved.
- Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success (include numeric target, if applicable) |
Actual result | Explanation |
PST back end redesign | Back end version 2 beta | Component totally rewritten from scratch as a Wikidata Query Service module. The choice to fork such a central project for Wikidata was made for self-sustainability and standardization purposes. |
PST front end redesign | Front end version 2 beta | Component ported from a Wikidata gadget to a MediaWiki extension and partially rewritten. The same rationales as the back end apply. |
Make the primary sources list usable | Filter module | Totally rewritten module, with workflow inspired from the previous version. Now a key part of the tool. |
Developer community engagement | Code reviews | Contributed Gerrit patches and GitHub pull requests allowed to attract developers outside of the team. |
Standard dataset release flow for third-party providers | Ingestion API | Data providers are now enabled to upload and update their datasets. |
StrepHit datasets version 2 | Confident dataset + supervised dataset, version 2 | Datasets with fully structured up-to-date references. |
StrepHit lexical database version 2 | None | Out of time |
StrepHit direct inclusion dataset | None | Out of time |
StrepHit unresolved entities dataset | None | Out of time |
Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?
With respect to the PST part, yes. On the other hand, we partially met the StrepHit goals, due to the unexpected workload that emerged during the development of the tool version 2. See the reasons in #What didn’t work.
Extra
editThe scheduled tasks on StrepHit led to the development of additional projects, aiming at the same target: to homogenize the release flow for third-party data providers.
- the QuickStatements to Wikidata RDF converter keeps the support for the QuickStatements format (compact and widespread in the Wikidata community, albeit non-standard) and alleviates the burden of the RDF format (more complex, yet a mature Web standard);
- the dataset booster is designed to make statement references uniform and as rich as possible.
Local Metrics
editMetric | Achieved outcome | Explanation |
1. Number of statements curated (approved + rejected) through the PST | 394,870 | Version 2 is not deployed in Wikidata yet: the reported value refers to the previous version. We still view it as an impact measure of this project. |
2. Engagement of open data organizations | None | We have not reached this phase, due to the unforeseen implementation efforts required by the PST. |
3. Number of PST users | 695 | Same reasons as metric 1. |
Global Metrics
editWe are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.
- Next to each metric, list the actual numerical outcome achieved through this project.
- Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."
For more information and a sample, see Global Metrics.
Metric | Achieved outcome | Explanation |
1. Number of active editors involved | 695 | This global metric corresponds to local metric 3. |
2. Number of new editors | 268 | Difference between metric 1 and the total users reported in the proposal: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#cite_note-4. |
3. Number of individuals involved | 150 (estimated) | Sum of the participants in the main dissemination events. |
4. Number of new images/media added to Wikimedia articles/pages | 13 | Not a target of this project, but still measurable. |
5. Number of articles added or improved on Wikimedia projects | 1,071,390 Wikidata statements | Sum of the StrepHit datasets version 2 statements. Not actually added to Wikidata until curation through the PST occurs. |
6. Absolute value of bytes added to or deleted from Wikimedia projects | N.A. | Not measurable until curation through the PST occurs. |
- Learning question
- Did your work increase the motivation of contributors, and how do you know?
It is not easy to give an objective answer. We list below some relevant feedback, both positive and negative:
- d:Wikidata_talk:Primary_sources_tool#Quality_of_sources;
- phab:T148150;
- d:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements#A_whitelist_for_sources;
- phab:T148142;
- phab:T145930.
Indicators of impact
editDo you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.
- Improve quality
- It is definitely the central goal since the very beginning of StrepHit: the lack of references in Wikidata is a significant obstacle to its data quality. With version 2, we refined the StrepHit statements and enriched their references. At the time of writing this report, we do not have any specific indicator for them, because the new version of the PST is not deployed in Wikidata yet. Nevertheless, a total of 287,988 statements got approved (read included into Wikidata) so far: this is clearly a sign of impact.
- Increase participation
- 268 new users of the PST since the StrepHit renewal proposal evidently indicate the impact.
- increase reach
- We expect that our efforts towards the flow standardization for third-party data providers will yield impact on this strategic priority.
Project resources
editPlease provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.
Wikidata primary sources tool
edit- Main repository: https://github.com/Wikidata/primarysources/
- Back end as a Wikidata query service module: https://github.com/marfox/pst-backend
- Web API interactive documentation: https://tools.wmflabs.org/primary-sources-v2/
- Java API documentation: https://tools.wmflabs.org/primary-sources-v2/javadoc/
- Front end as a MediaWiki extension
- GitHub repository: https://github.com/marfox/pst-frontend
- Gerrit repository: https://gerrit.wikimedia.org/g/mediawiki/extensions/PrimarySources
- MediaWiki extension page: mw:Extension:PrimarySources
Datasets
edit- StrepHit confident version 2, QuickStatements format: https://tools.wmflabs.org/primary-sources-v2/strephit_confident_v2.qs
- StrepHit confident version 2, Turtle format: https://tools.wmflabs.org/primary-sources-v2/strephit_confident_v2.ttl
- StrepHit supervised version 2, QuickStatements format: https://tools.wmflabs.org/primary-sources-v2/strephit_supervised_v2.qs
- StrepHit supervised version 2, Turtle format: https://tools.wmflabs.org/primary-sources-v2/strephit_supervised_v2.ttl
Dissemination
edit- Slides of the talk given at WikiCite 2017: commons:File:The Primary Sources Tool.pdf;
- the Italian community at ItWikiCon 2017, the first national Wiki conference: commons:File:ItWikiCon 2017 - Group photo 08.jpg;
- can you spot Hjfocs in this ItWikiCon 2017 video recorded by a national TV channel, local news (in Italian)? https://www.facebook.com/TgrRaiTrentino/videos/1214433018701415/
Contributions to Wiki pages
edit- PST uplift proposal: d:Wikidata:Primary_sources_tool;
- Wikidata edits made by the main grantee during the project time span: https://www.wikidata.org/w/index.php?limit=500&title=Special%3AContributions&contribs=user&target=Hjfocs&hideMinor=1&start=2017-05-22&end=2018-05-21;
- walkthrough to enable the Wikidata role in MediaWiki Vagrant: mw:MediaWiki-Vagrant#wikidata;
- tutorial on how to import a Wikidata dump into a MediaWiki Vagrant instance: mw:MediaWiki-Vagrant#How_to_import_a_Wikidata_dump;
- update documentation on Cloud VPS instance management: wikitech:Help:Instances#Managing_Instances;
- highlight that a Web proxy is enough if you just need a Web server on your Cloud VPS instance: wikitech:Help:Addresses;
- more detailed procedure to set up a MediaWiki Vagrant instance: wikitech:Help:MediaWiki-Vagrant_in_Cloud_VPS;
- uploads to Commons: commons:Category:Primary_sources_tool.
Side projects
edit- QuickStatements to Wikidata RDF converter: https://github.com/marfox/qs2rdf/
- QuickStatements dataset booster: https://github.com/EdoardoLenzi9/Wikipedia.StrepHit
Learning
editThe best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.
What worked well
editWhat did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
- Learning_patterns/Working_with_developers_who_are_not_Wikimedians;
- Learning_patterns/Writing_a_new_MediaWiki_extension_for_deployment_on_a_Wikimedia_project.
What didn’t work
editWhat did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.
The following set of issues related to the PST part had a highly negative effect on the whole project plan, preventing the team from effectively addressing the StrepHit one:
- the front end version 1 was designed in an inflexible way, which precluded both the implementation of some desired features and appropriate code refactoring;
- we felt that MediaWiki software development often follows peculiar practices, needing a longer time to understand what to do and how;
- scattered outdated and sometimes redundant documentation make the duty even harder;
- the development of a MediaWiki extension has a steep learning curve;
- the Wikidata Query Service is a big Java project and resulted in an overkill for our purposes. More specifically:
- relatively straightforward Web services actually demanded a lot of code;
- the deployment had high memory requirements, excluding the option of a Toolforge machine.
Other recommendations
editIf you have additional recommendations or reflections that don’t fit into the above sections, please list them here.
Quoting a sentence from the midpoint report:
“ | The main lesson learnt may sound rather sad, but would certainly allow one to save the vast majority of the overall implementation effort in future projects: it is probably more reasonable to develop a new version of a piece of software from scratch, rather than to try to put the hands on existing code that was built in a totally undocumented, quick-and-dirty fashion. | ” |
Next steps and opportunities
editAre there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.
As already mentioned, the team almost totally devoted its efforts to the PST. Therefore, a significant amount of work is still needed to take StrepHit to the next level, and can result in a new grant proposal. See the section below.
StrepHit datasets
edit- The entity reconciliation task was responsible for most errors in the final output, as highlighted in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Final#Claim_Correctness_Evaluation (
Resolver
column). This will be a key activity of the upcoming soweego project, where we expect to develop methods that are likely to be useful for StrepHit as well; - the knowledge representation task caters for the conversion between facts extracted from natural language and Wikidata statements. It plays an essential role in the actual correctness of the final output. The lexical database behind this task still needs a rethink;
- the planned bot to import statements above a high confidence threshold has to be implemented;
- the dataset of unresolved entities has to be analyzed yet.
Primary sources tool
edit- Open tasks: phab:project/board/2788/query/open/;
- Technical details for the back end:
- the Blazegraph data loader already performs RDF syntax check;
- for scalability purposes, reading RDF in streaming mode with NT serialization, instead of loading Turtle in memory;
- add
rdf:type
triples to qualifiers and references, and investigate the trade-off between faster execution of queries and higher volume of datasets; - implement a way to automatically reindex the StrepHit corpus when reference URLs change;
- Technical details for the front end:
- optimize the query for the
Entity of interest
filter; - understand the optimal SPARQL query limit threshold to avoid empty result tables in the autocompletion filters;
- enable the support for on-the-fly reference editing on suggested statements.
- optimize the query for the
Freebase (third-party dataset)
editSignificant effort is required to slice this huge dataset into a suitable subset for the PST.
See the Related to Freebase
column in the Phabricator project page for more details: phab:project/board/2788/.
Part 2: The Grant
editFinances
editActual spending
editPlease copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.
Expense | Approved amount | Actual funds spent | Difference |
Project Leader | 41,379 € | 43,260 € | Balancing based on the other items. |
WikiCite 2017(1) | 1,196 € | 106 € | The rescheduled event took place near the grantee's physical location. |
Wikimania 2017 | 2,079 € | 1,838 € | Flight costs were a little lower than the estimated ones. |
Training Set | 550 € | 0 € | Not enough time for task S2. |
Total | 45,204 € | 45,204 € |
(1) replaces Wikimedia Developer Summit 2017.
Remaining funds
editDo you have any unspent funds from the grant?
Please answer yes or no. If yes, list the amount you did not use and explain why.
No.
Documentation
editDid you send documentation of all expenses paid with grant funds to grantsadmin wikimedia.org, according to the guidelines here?
Please answer yes or no. If no, include an explanation.
Yes.
Confirmation of project status
editDid you comply with the requirements specified by WMF in the grant agreement?
Please answer yes or no.
Yes.
Is your project completed?
Please answer yes or no.
Yes, although some tasks are left open, particularly those tagged as Epic
in the To do
column: phab:project/board/2788/.
Grantee reflection
editWe’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!
We would just like to thank all the great people from the Wikimedia world who helped us build the new version of the PST. The list is in chronological order and may not be exhaustive, so please forgive us if we missed you!
- LZia_(WMF), research scientist at WMF, for her irreplaceable technical advice and precise reviews of this project;
- Mjohnson_(WMF), program officer at WMF, for being a tower of strength throughout the grant, and for setting up the grantees luncheon at Wikimania 2017;
- Jtud_(WMF), grants administrator at WMF, for the prompt crystal clear written communications;
- Dario_(WMF), head of research at WMF, for organizing the key event WikiCite and for converging to a shared research vision;
- Lydia_Pintscher_(WMDE) and the whole Wikidata team for their incalculable ceaseless support;
- Tpt, core developer of the gadget version and publisher of the Freebase datasets, for his keen guidance;
- T_Arrow and the WikiFactMine team, for the crucial conversations;
- Sjoerddebruin, power Wikidata user, for the regular interactions;
- Smalyshev_(WMF), core developer of the mw:Wikidata_query_service, for his code reviews and the fruitful discussion at Wikimania 2017;
- GLederrey_(WMF), operations engineer at WMF, for his priceless Java tips and best practices;
- BDavis_(WMF), engineering manager at WMF, for his vital assistance on the Cloud VPS infrastructure.