Grants:IEG/StrepHit: Wikidata Statements Validation via References/Final
This project is funded by an Individual Engagement Grant
This Individual Engagement Grant is renewed
renewal scope | timeline & progress | finances | midpoint report | final report |
Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.
Part 1: The Project
editSummary
editIn a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.
Planned achievements, as per the project goals and timeline:
- Web Sources Production Corpus: 1.8 M items, 515 k documents (biographies), 53 reliable sources;
- Candidate Relations Set: 49 frames, 229 total frame elements, 133 unique frame elements, 69 unique Wikidata relations;
- StrepHit Pipeline Beta: v. 1.0 beta, v. 1.1 beta;
- Web Sources Knowledge Base: 842 k confident claims + 958 k supervised + 808 k rule-based = 2.6 M total claims;
- Primary Sources Tool: 5 merged pull requests, active request for comment.
Bonus achievements, beyond the goals:
- Web Sources Corpus: +265 k (+106%) documents, +3 sources;
- Candidate Relations Set: +19 (+38%) Wikidata relations;
- Web Sources Knowledge Base: +359 k (+16%) Wikidata claims;
- Candidate Items dataset: a set of entities found in the corpus that could be added to Wikidata (needs validation);
- Wiki Loves Monuments Italy: a prototype dataset for Wikidata;
- Italian companies dataset: a proof-of-scalability dataset (another language, another domain), as a result of the HackAtoka hackathon.
Codebase: 9,425 lines of Python code, 2 releases, 474 commits, 16 open issues, 37 closed issues.
- Access all the resources here: #Project_resources
- Play with the datasets. Read the instructions and provide feedback here: wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements
Methods and activities
editWhat did you do in your project?
Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 3 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.
The project has been managed as per Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Methods_and_activities.
Dissemination
editAs detailed in the planned outreach activities, the April and May monthly reports, we conducted the following dissemination efforts after the midpoint.
- HackAtoka hackathon at SpazioDati: http://blog.atoka.io/hackatoka-open-innovation-al-lavoro-per-testare-le-nuove-atoka-api/ (in Italian)
- the StrepHit team in action, picture 1: http://blog.atoka.io/wp-content/uploads/2016/05/hackAtoka-brainstorming.jpg
- picture 2: http://blog.atoka.io/wp-content/uploads/2016/05/hackAtoka-MachineReadingNewsAPI-1024x683.jpg
- major revision of the research article submitted to the Semantic Web Journal: http://semantic-web-journal.org/content/n-ary-relation-extraction-simultaneous-t-box-and-box-knowledge-base-augmentation
- hackathon at Spaghetti Open Data Reunion: http://www.spaghettiopendata.org/content/wikidata-la-banca-di-conoscenza-libera-casa-wikimedia
- see the attendees: https://twitter.com/SignoraRamsay/status/728873548643770368/
- WikiCite 2016: WikiCite_2016, WikiCite_2016/Proposals/Generation_of_referenced_Wikidata_statements_with_StrepHit, WikiCite_2016/Report/Group_4
- Poster at Wikimania 2016: https://wikimania2016.wikimedia.org/wiki/Posters#StrepHit
Side Projects
editBesides StrepHit, we have been contributing to the following projects:
- Primary Sources Tool, with 5 merged pull requests [1], [2], [3], [4], [5]
- Prototype import of Wiki Loves Monument Italy into Wikidata: http://it.dbpedia.org/downloads/strephit/wlm_italy_prototype/, wikidata:Wikidata:Project_chat/Archive/2016/06#Importing_Wiki_Loves_Monuments_lists_into_Wikidata
- Sphinx Python documentation builder: https://github.com/sphinx-doc/sphinx/pull/2444, https://github.com/Wikidata/StrepHit/tree/master/strephit/sphinx_wikisyntax
Outcomes and impact
editOutcomes as per stated goals
editWhat are the results of your project?
Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.
The key planned outcomes of StrepHit are:
- the Web Sources Corpus, composed of 1.8 M items circa gathered from 53 reliable Web sources;
- the Natural Language Processing pipeline to extract Wikidata claims from free text;
- the Web Sources Knowledge Base, composed of 2.6 M Wikidata claims circa.
Please use the table below to:
- List each of your original measures of success (your targets) from your project plan.
- List the actual outcome that was achieved.
- Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success (include numeric target, if applicable) |
Actual result | Explanation |
Web Sources Production Corpus: 250 k documents, 50 sources | 1,778,149 items, 515,212 documents, 53 sources | +265,212 (+106%) documents, +3 sources |
Candidate Relations Set: 50 Wikidata relations | 49 frames, 229 total frame elements, 133 unique frame elements, 69 unique Wikidata relations | +19 (+38%) Wikidata relations |
StrepHit Pipeline Beta | releases: v. 1.0 beta, v. 1.1 beta | the first version is a working NLP pipeline. The second one contains improvements of the supervised classification system. |
Web Sources Knowledge Base: 2.25 M Wikidata claims | 842,208 confident + 958,491 supervised + 808,708 rule-based = 2,609,407 total claims | +359,407 (+16%) Wikidata claims. Note that we picked a subset of the rule-based output, with confidence scores > 0.8. The whole output actually contains 2,822,538 claims, and we would have obtained a much larger knowledge base: 4,623,237 total claims, thus +2,373,237 (+105%). However, we decided to discard potentially low-quality ones. |
Primary Sources Tool | 5 merged pull requests, active community discussion | On one hand, we have implemented the planned features. On the other, we have centralized the discussion on the tool usability and the available datasets. |
Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?
Yes. Not only we achieved all the in-scope goals as per Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#In_Scope, but we also exceeded all the quantitative expectations.
Furthermore, we produced a set of bonus achievements.
Bonus Outcomes
editBesides the planned goals, we reached the following bonus outcomes, in order of relevance to the Wikidata community:
- the unresolved entities dataset. When generating the Web Sources Knowledge Base, a (rather large) set of entities could not be resolved to Wikidata QIDs. They may serve as candidates for new Wikidata Items;
- the Wiki Loves Monuments for Wikidata prototype dataset. We were contacted by Wikimedia Italy to implement a very first integration of a WLM Italy dataset into Wikidata;
- a rule-based statement extraction technique, which does not require any training set, although it may yield less accurate extractions. It can be thought as a trade-off between the text annotation and the statement validation costs;
- the Italian companies dataset, as a result of the HackAtoka hackathon. It is a proof of scalability for the StrepHit pipeline: the rule-based technique has been succesfully applied to another domain (companies), in another language (Italian).
Classification Output
edit
Claim Correctness Evaluation
editWe carried out an empirical evaluation over the final output results, by randomly sampling 48 claims from the supervised and the rule-based datasets. Since StrepHit is a pipeline with several components, we computed the accuracy of those responsible for the actual generation of claims. Results indicate the ratio of correct data for each of them, as well as the overall claim correctness. The reader may refer to the BSc thesis [6] and the article [7] for full details of the system architecture.
Dataset | Claims | Linker | Classifier | Normalizer | Resolver | Overall |
---|---|---|---|---|---|---|
supervised | 48 | 0.8125 | 0.781 | 1 | 0.285 | 0.638 |
rule-based | 48 | 0.709 | 0.607 | 1 | 0.5 | 0.588 |
Sample Claims
editMachine-readable ones are expressed in the QuickStatements syntax [8].
Correct Examples
editMachine | Human |
---|---|
Q18526540 P569 +00000001815-02-24T00:00:00Z/11 S854 "http://adb.anu.edu.au/biography/barkly-sir-henry-2936" |
According to the Australian Dictionary of Biography, Arthur Barkly was born on February 24, 1815 |
Q16058737 P106 Q80687 S854 "https://ia902707.us.archive.org/1/items/biographicaldict08johnuoft/biographicaldict08johnuoft_djvu.txt" |
According to The Biographical Dictionary of America, Charles Millard Pratt has been a secretary |
Q515632 P69 Q1068752 S854 "http://www.nndb.com/people/215/000042089/" |
According to the Notable Names Database, Ossie Davis was educated at Howard University |
Q18922309 P937 Q777039 S854 "http://munksroll.rcplondon.ac.uk/Biography/Details/140" |
According to the Royal College of Physicians, Henry Ashby has worked at Guy's Hospital |
Q4861627 P19 Q739700 S854 "http://www.bbc.co.uk/arts/yourpaintings/artists/barnett-freedman" |
According to the BBC Your Paintings (now Art UK), Barnett Freedman was born in the East End of London |
Wrong Examples
editMachine | Human | Comments |
---|---|---|
Q21454578 P463 Q42482 S854 "http://www.metal-archives.com/artists/Hugh_Gilmour/84280" |
According to Encyclopædia Metallum, Hugh Gilmour was a member of the Iron Maiden | possibly homonymous subject (incorrect resolution), incorrect classification |
Q28144 P101 Q1193470 S854 "http://www.museothyssen.org/en/thyssen/ficha_artista/301" |
According to the Thyssen-Bornemisza Museum, Willem Kalf's field of work is theme music | incorrect entity linking, incorrect classification |
Q3437676 P170 Q3908516 S854 "https://www.daao.org.au/bio/david-granger/" |
According to Design & Art Australia Online, David Granger is the creator of entrepreneurship | homonymous subject (incorrect resolution), incorrect classification |
References Statistics
editDomain | Confident | Supervised | Rule-based |
---|---|---|---|
adb.anu.edu.au | 52,419 | 154,979 | 119,239 |
collection.britishmuseum.org | 238,308 | 20,912 | 29,046 |
gameo.org | 2,113 | 6,544 | 7,334 |
munksroll.rcplondon.ac.uk | 4,114 | 18,438 | 12,649 |
archive.org | 8,103 | 39,062 | 30,146 |
collection.cooperhewitt.org | 2,383 | 11,550 | 13,677 |
sculpture.gla.ac.uk | 1,663 | 1,474 | 1,182 |
dictionaryofarthistorians.org | 1,358 | 3,620 | 4,969 |
en.wikisource.org | 51,232 | 227,346 | 209,411 |
rkd.nl | 44,690 | N.A. | N.A. |
structurae.net | 1,851 | N.A. | N.A. |
vocab.getty.edu | 213,436 | 6,137 | 4,052 |
www.bbc.co.uk | 54,070 | 2,109 | 2,254 |
www.brown.edu | N.A. | 1,200 | 1,144 |
www.daao.org.au | N.A. | 26,848 | 21,256 |
www.genealogics.org | 19,870 | 10,186 | 14,536 |
www.metal-archives.com | N.A. | 760 | 1,796 |
www.museothyssen.org | 1,468 | 1,498 | 2,096 |
www.newulsterbiography.co.uk | 3,284 | 3,438 | 5,379 |
www.nndb.com | 106,782 | 26,402 | 30,101 |
www.uni-stuttgart.de | 20,627 | N.A. | N.A. |
www.wga.hu | 9,762 | 5,088 | 5,944 |
yba.llgc.org.uk | 4,645 | 6,912 | 9,599 |
Total | 842,191 | 574,503 | 525,811 |
Grand total | 1,942,505 |
Local Metrics
editMetric | Achieved outcome | Explanation |
1. Number of statements curated (approved + rejected) via the primary sources tool | 127,072 | It was not possible to measure the number of StrepHit-specific statements during the course of the project, since the final dataset (i.e., the Web Sources Knowledge Base), was expected at the end (cf. the last milestone in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline). We report instead the total number of curated statements, which we still believe to be a valid indicator of how StrepHit fostered the use of the primary sources tool. |
2. Number of primary sources tool users | 282 total, 10 of them also edited StrepHit data | It was not possible to fully assess this metric during the course of the project, for the same reason as the above one. The result shown was measured upon an unplanned bonus outcome, namely the Semi-structured Dataset (cf. Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Bonus_Milestone:_Semi-structured_Development_Dataset) |
3. Number of involved data donors from Open Data organizations | 1 explicit, 2 potential | ContentMine has expressed its interest in a data donation to Wikidata via the primary sources tool, as per Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Support_from_ContentMine. We have also received informal statements by Openpolis (governmental data), and Till Sauerwein (biological data), which have not resulted in any written statement yet. |
4. Wikidata request for comment process | wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements | The discussion on the primary sources tool and its datasets is centralized. |
Global Metrics
editWe are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.
- Next to each metric, list the actual numerical outcome achieved through this project.
- Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."
For more information and a sample, see Global Metrics.
Metric | Achieved outcome | Explanation |
1. Number of active editors involved | 282 total, 10 of them also edited StrepHit data | This global metric naturally maps to #2 of the local ones (cf. #Local_Metrics). |
2. Number of new editors | unknown | Not sure how to measure this. |
3. Number of individuals involved | 80 (estimated) | The dissemination activities have led to the involvement of several individuals, ranging from seminar attendees (both physical and remote), to hackathon participants, all the way to software contributors. The reported outcome is a rough estimate. |
4. Number of new images/media added to Wikimedia articles/pages | 0 | Not a goal of the project. |
5. Number of articles added or improved on Wikimedia projects | 2,609,407 Wikidata claims | This global metric naturally maps to the Web Sources Knowledge Base in-scope goal (cf. #Outcomes_as_per_stated_goals) |
6. Absolute value of bytes added to or deleted from Wikimedia projects | N.A. | Most of StrepHit data will undergo a validation step via the primary sources tool before their eventual inclusion into Wikidata. Hence, this metric can only be measured after that, and does not represent a relevant indicator of the actual content modification of this project anyway. On the other hand, the confident subset of the data will be directly uploaded after approval by the community. |
- Learning question
- Did your work increase the motivation of contributors, and how do you know?
Probably yes, but this is difficult to record. Besides the request for comment and the past endorsements, we report below the most prominent examples among the positive feedback collected so far:
- Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Letters_of_Support
- Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Support_from_ContentMine
- https://twitter.com/gplynch/status/743836190713974784
- https://twitter.com/ReaderMeter/status/743141602781192192
- https://twitter.com/SignoraRamsay/status/728873548643770368
Indicators of impact
editDo you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.
Improve quality
editThe trustworthiness of Wikidata is an essential requirement for high quality: this entails the addition of references to authenticate its content. However, a large amount of assertions still lacks of references. StrepHit ultimately aims at enhancing the quality of Wikidata statements via references to third-party autorithative Web sources.
The system is designed to guarantee at least one reference per statement, and has achieved to do so for a total of 1,942,505 statements. This clearly indicates the quality improvement.
Increase participation
editWe believe to have spent considerable effort in requesting feedback on the primary sources tool and its dataset. This was mainly achieved through the hackathons and the work group at WikiCite 2016.
Increase reach
editOutreach and dissemination activities for increasing the readership have been a high priority from the very beginning. We provide pointers to the full list at #Dissemination.
Project resources
editPlease provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.
Data
edit- Web Sources Corpus
- Lexical database: http://it.dbpedia.org/downloads/strephit/lexical_db.json
- Web Sources Knowledge Base
- Confident dataset: http://it.dbpedia.org/downloads/strephit/web_sources_knowledge_base/confident_dataset.qs.gz
- Supervised dataset: http://it.dbpedia.org/downloads/strephit/web_sources_knowledge_base/supervised_dataset.qs.gz
- Rule-based dataset: http://it.dbpedia.org/downloads/strephit/web_sources_knowledge_base/rule-based_dataset.qs.gz
- Unresolved entities
- Confident: http://it.dbpedia.org/downloads/strephit/unresolved_entities/confident_unresolved.jsonl.gz
- Supervised: http://it.dbpedia.org/downloads/strephit/unresolved_entities/supervised_unresolved.jsonl.gz
- Rule-based: http://it.dbpedia.org/downloads/strephit/unresolved_entities/rule-based_unresolved.jsonl.gz
- Wiki Loves Monuments Italy prototype: http://it.dbpedia.org/downloads/strephit/wlm_italy_prototype/
- Italian Companies
- Corpus: http://it.dbpedia.org/downloads/strephit/italian_companies_dataset/hackatoka_corpus.jsonl.gz
- Lexical database: http://it.dbpedia.org/downloads/strephit/italian_companies_dataset/hackatoka_lexical_db.json
- Dataset (not resolved to Wikidata): http://it.dbpedia.org/downloads/strephit/italian_companies_dataset/hackatoka_dataset.jsonl.gz
- All other resources at: http://it.dbpedia.org/downloads/strephit/
Technical
edit- Codebase: https://github.com/Wikidata/StrepHit
- Documentation: https://www.mediawiki.org/wiki/StrepHit
Research
edit- Article
- First revision: http://semantic-web-journal.org/system/files/swj1270.pdf
- Second revision: http://semantic-web-journal.org/system/files/swj1380.pdf
- BSc thesis: http://it.dbpedia.org/downloads/strephit/emilio_dorigatti_bsc_thesis.pdf
Dissemination
edit- Kick-off seminar
- Event at Lugano: http://www.ated.ch/manifestazioni/7/web-30-il-potenziale-del-web-semantico-e-dei-dati-strutturati_3194.html (in Italian)
- HackAtoka hackathon: http://blog.atoka.io/hackatoka-open-innovation-al-lavoro-per-testare-le-nuove-atoka-api/ (in Italian)
- Spaghetti Open Data Reunion hackathon: http://www.spaghettiopendata.org/content/wikidata-la-banca-di-conoscenza-libera-casa-wikimedia
- WikiCite 2016:
- Main page: WikiCite_2016
- Proposal: WikiCite_2016/Proposals/Generation_of_referenced_Wikidata_statements_with_StrepHit
- Work group: WikiCite_2016/Report/Group_4
- Wikimania 2016 poster: https://wikimania2016.wikimedia.org/wiki/Posters#StrepHit
- Request for comment: wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements
Learning
editThe best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.
What worked well
editWhat did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
What didn’t work
editWhat did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.
- The lessons learnt reported in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Challenges still apply;
- in general, we faced two major unplanned tasks, which affected the overall schedule of the project:
- construction of a suitable lexical database, since FrameNet failed to meet our needs;
- second revision of the scientific article.
- Both had a negative impact in the most delicate planned task, namely building the crowdsourced training set.
- We had to sum additional issues related to the crowdsourcing platform and the nature of the input corpus. Respectively:
- high execution time for certain lexical units that are not trivial to annotate (at the time of writing this report, some jobs are still running);
- high percentage of sentence chunks that cannot be labeled with any frame element (more than 50% on average), which resulted in a relatively large amount of empty sentences even after the annotation.
- This prevented us from reaching a sufficient amount of training samples, thus causing a generally low performance of the supervised classifier, depending on the lexical unit;
- Finding a general-purpose method to serialize the classification results into Wikidata assertions was impossible, since we needed to understand the intendend meaning of each Wikidata property, i.e., how it is used to represent the Wikidata world.
Next steps and opportunities
editAre there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.
- StrepHit:
- Extend language capabilities to high-coverage languages, namely Spanish, French, and Italian;
- improve the performance of the supervised extraction;
- fix open issues.
- Primary sources tool:
- take control over the codebase;
- tackle usability issues;
- implement known feature requests.
Part 2: The Grant
editFinances
editActual spending
editPlease copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.
As mentioned in Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Finances, we adjusted the dissemination budget item, due to the unexpected low cost of the planned activities. Due to the issues detailed in #What_didn't_work, we also adjusted the training set one.
Expense | Approved amount | Actual funds spent | Difference |
Project Leader | $21,908 | ||
NLP Developer | $7,160 | ||
Training Set | $500 | ||
Dissemination | $432 | ||
Total | $30,000 |
Remaining funds
editDo you have any unspent funds from the grant?
Please answer yes or no. If yes, list the amount you did not use and explain why.
No.
If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:
Documentation
editDid you send documentation of all expenses paid with grant funds to grantsadmin wikimedia.org, according to the guidelines here?
Please answer yes or no. If no, include an explanation.
Yes.
Confirmation of project status
editDid you comply with the requirements specified by WMF in the grant agreement?
Please answer yes or no.
Yes.
Is your project completed?
Please answer yes or no.
Regarding this 6-month scope, yes.
Grantee reflection
editWe’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!
- The grant really allowed us to get involved into the community. We believe the IEG program is a complete success with respect to the individual engagement process (and the program title is a perfect fit);
- During the events we attended, we met lots of Wikimedians in person, and felt that we all share a huge amount of enthusiasm;
- Quoting an earlier reflection: "the Wikimedian community seems to have a silent minority, instead of majority: when asking for feedback, we always received constructive answers". This is indeed a virtuous circle: during the round 1 2016 IEG call, Ester, a grantee candidate, approached us. We were just pleased to help her improve the proposal, as much as we got invaluable support when writing ours;
- Thanks to Marti for the setup, we had the chance to have lunch with other IEGrantees at Wikimania 2016. It was just great to meet them, we really felt part of a community.
- We are extremely thankful to all the people who helped us during the course of the project. Making an exhaustive list would be impossible! Special thanks go to the Wikidata team, the Wikimedia research team, the IEG program officers, the IEG reviewers, and the endorsers.