Grants:IEG/StrepHit: Wikidata Statements Validation via References/Midpoint
This project is funded by an Individual Engagement Grant
This Individual Engagement Grant is renewed
renewal scope | timeline & progress | finances | midpoint report | final report |
Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.
Summary
editIn a few short sentences or bullet points, give the main highlights of what happened with your project so far.
Planned achievements, as per the project goals and timeline:
- Web Sources Development Corpus: 1.6 M items, 500 k documents (biographies), 53 reliable sources;
- Candidate Relations Set: 50 relations;
- Primary Sources Tool: increased development activity [1], 2 merged pull requests [2], [3].
Bonus achievements, beyond the goals:
- Web Sources Corpus: +300 k (+150%) documents, +13 sources, compared to the expected size of the development corpus (cf. Work Package T1);
- Semi-structured Development Dataset: 100 k Wikidata statements.
Codebase: 6.7 k lines of Python code, 311 commits, 10 open issues, 13 closed issues.
Methods and activities
editHow have you setup your project, and what work has been completed so far?
Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.
Technical Setup
edit- We requested the credentials and created a GitHub repository within the Wikidata organization [4];
- the official documentation page is hosted at mediawiki.org [5] (work in progress);
- besides the planned work package, special development efforts have been devoted to:
- a modular architecture;
- parallel processing;
- caching;
- let StrepHit be used both as a library and as a set of command line tools;
- an easy-to-use command line to run all the pipeline steps;
- a flexible logging facility.
Project Management
edit- Monday face-to-face meetings for brainstorming ideas and weekly planning;
- daily scrums, especially for unexpected technical issues, but also for brainstorming;
- whiteboard for crystallized ideas;
- yellow stickers on the whiteboard for ideas to be investigated;
- regular interaction with relevant mailing lists and key people to discuss potential impacts and to gather suggestions;
- project dissemination in the form of seminars and talks.
Research Activities
editAs a further outreach point for research communities, we have submitted a full article to the Semantic Web Journal, [6] among the top ones worldwide in the Information Systems field [7]. The whole process is known to be time-consuming: we have so far uploaded a first version, [8] focusing on past efforts carried out with DBpedia. It has passed the first round of reviews. We are currently working on a major revision that will include more details concerning StrepHit.
Midpoint outcomes
editWhat are the results of your project or any experiments you’ve worked on so far?
Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.
From a technical perspective, the project has so far delivered software and content. Specifically, with respect to software, the following modules have reached a mature state:
- Web Sources Corpus [9], i.e., a set of Web spiders that harvest data from the selected biographical authoritative sources [10];
- Corpus Analysis [11], i.e., a set of scripts to process the corpus and to generate a ranking of the Candidate Relations;
- Commons [12], i.e., several facilities to ensure a scalable and reusable codebase. On the general-purpose hand, these include parallel processing, fine-grained logging, and caching. On the specific Natural Language Processing (NLP) hand, special attention is paid to foster future multilingual implementations, thanks to the modularity of the NLP components, such as tokenization [13], sentence splitting [14], and part-of-speech tagging [15].
The following modules have been started and are in active development:
- Extraction [16], i.e., the logic needed to extract different set of sentences, to be used for training and testing the classifier, as well as for the actual production of Wikidata content;
- Annotation [17], i.e, a set of scripts to interact with the CrowdFlower crowdsourcing platform APIs, in order to create and post annotation jobs, and to pull results.
Content outcomes are presented in the next sections.
Milestone #1: Web Sources Development Corpus
editItems & Biographies across Web Domains
editSource domain | # items | # biographies |
---|---|---|
www.genealogics.org | 447,045 | 10,621 |
www.metal-archives.com | 355,784 | 7,988 |
rkd.nl | 206,993 | 0 |
vocab.getty.edu | 199,502 | 199,496 |
collection.britishmuseum.org | 118,883 | 101,117 |
en.wikisource.org | 60,403 | 60,355 |
www.nndb.com | 40,331 | 40,331 |
www.bbc.co.uk | 38,018 | 1,321 |
www.catholic-hierarchy.org | 37,313 | 0 |
www.daao.org.au | 19,696 | 9,848 |
adb.anu.edu.au | 19,086 | 19,086 |
gameo.org | 13,858 | 13,850 |
www.uni-stuttgart.de | 10,679 | 0 |
archive.org | 8,721 | 8,719 |
cesar.org.uk | 7,044 | 0 |
munksroll.rcplondon.ac.uk | 6,959 | 6,921 |
sculpture.gla.ac.uk | 6,378 | 5,631 |
structurae.net | 6,340 | 0 |
yba.llgc.org.uk | 4,470 | 4,470 |
www.wga.hu | 3,952 | 3,927 |
collection.cooperhewitt.org | 3,407 | 3,407 |
dictionaryofarthistorians.org | 2,442 | 2,259 |
www.newulsterbiography.co.uk | 2,060 | 2,060 |
royalsociety.org | 1,596 | 1,580 |
www.parliament.uk | 650 | 0 |
www.museothyssen.org | 627 | 585 |
www.brown.edu | 601 | 601 |
www.academia-net.org | 525 | 0 |
Total | 1,623,381 | 504,189 |
Items & Biographies Wikisource Breakdown
editSource | # items | # biographies |
---|---|---|
DNB | 28,001 | 27,997 |
Catholic Encyclopedia | 11,466 | 11,462 |
Naval Bio | 4,692 | 4,688 |
Indian Bio | 2,440 | 2,427 |
American Bio | 2,209 | 2,207 |
National Bio 1912 | 1,631 | 1,631 |
Australasian Bio | 1,590 | 1,590 |
Irish Officers | 1,530 | 1,524 |
Bio English Lit | 1,346 | 1,340 |
Men at the Bar | 1,115 | 1,115 |
National Bio 1,901 | 1,033 | 1,033 |
Christian Bio | 921 | 921 |
Musicians | 702 | 702 |
Freethinkers | 546 | 546 |
Men of Time | 432 | 431 |
Chinese Bio | 245 | 245 |
English Artists | 223 | 223 |
Medical Bio | 109 | 109 |
Portraits and Sketches | 50 | 50 |
Who is who in China | 47 | 47 |
Greek Roman bio Myth | 37 | 37 |
Modern English Bio | 11 | 11 |
Who is who America | 10 | 10 |
Total | 60,403 | 60,355 |
Milestone #2: Candidate Relations Set
edit- Download the ranking of verbs only;
- Download the full ranking with frame data.
- Download the set of Frame Elements (FEs).
The ranking is composed of verbs discovered via the corpus analysis module. Each of them will trigger a set of Wikidata properties, depending on the number of FEs (cf. the set above).
Currently, a total of 173 distinct FEs is extracted. The final amount of Wikidata properties will rely on a mapping, planned as per Work Package T8.1. We have already implemented a straightforward automatic mapping facility, based on string matching.
Ranking
editbear
issue
work
print
play
live
include
exhibit
write
paint
serve
study
appoint
return
go
name
appear
call
leave
draw
lead
record
move
found
join
begin
teach
elect
remain
succeed
produce
act
enter
establish
add
create
continue
travel
win
visit
form
send
command
bring
attend
retire
promote
meet
kill
employ
Bonus Milestone: Semi-structured Development Dataset
editDuring the corpus collection phase, we were asked (thanks Spinster!) to include sources with semi-structured data (cf. the list of selected sources), typically names and dates.
The result is a dataset that caters for the following Wikidata properties:
- birth name;
- given name;
- family name;
- pseudonym;
- honorific suffix;
- date of birth;
- date of death;
- sex or gender.
Sample Statements
editMachine-readable ones are expressed in the QuickStatements syntax [18].
Correct Examples
editMachine | Human |
---|---|
Q389547 P570 +00000001837-01-01T00:00:00Z/9 S854 "http://www.bbc.co.uk/arts/yourpaintings/artists/hodges-charles-howard-17641837" |
According to BBC Your Paintings, Charles Howard Hodges died in 1837 |
Q17355708 P1477 "emma nicol" S854 "https://en.wikisource.org/wiki/Nicol,_Emma_(DNB00)" |
According to the Dictionary of National Biography, Emma Nicol's birth name is "emma nicol" |
Q594729 P21 Q6581097 S854 "http://vocab.getty.edu/ulan/500110819" |
According to the Union List of Artist Names, Anton Teichlein is a male |
Q215502 P742 "Morgan, Henry" S854 "http://collection.britishmuseum.org/id/person-institution/156902" |
According to the British Museum, Henry Morgan's pseudonym is "Morgan, Henry" |
Q1562861 P569 +00000001939-08-21T00:00:00Z/11 S854 "http://www.nndb.com/people/103/000024031/" |
According to the Notable Names Database, Clarence Williams III was born in 1939 |
Questionable Examples
editMachine | Human | Comments |
---|---|---|
Q3770981 P1477 "giusepe melani" S854 "http://vocab.getty.edu/ulan/500051662" |
According to Union List of Artist Names, Giuseppe Melani's birth name is "giusepe melani" | the source is wrong (typo?) |
Q598060 P742 "Martyr Vermigli, Peter" S854 "http://collection.britishmuseum.org/id/person-institution/112005" |
According to the British Museum, Peter Martyr Vermigli's pseudonym is "Martyr Vermigli, Peter" | debatable source assertion and Wikidata property label |
Q57297 P742 "E.W.L.T.; Ernesto Guglielmo Temple ; http://viaf.org/viaf/45102696" S854 "http://www.uni-stuttgart.de/hi/gnt/dsi2/index.php?table_name=dsi&function=details&where_field=id&where_value=5752" |
According to the Database of Scientific Illustrators, Wilhelm Tempel's pseudonym is "E.W.L.T.; Ernesto Guglielmo Temple ; http://viaf.org/viaf/45102696" | wrong parsing of the source data |
References Statistics
editDomain | # references |
---|---|
adb.anu.edu.au | 6,262 |
collection.britishmuseum.org | 17,456 |
gameo.org | 238 |
munksroll.rcplondon.ac.uk | 418 |
archive.org | 1,166 |
collection.cooperhewitt.org | 366 |
sculpture.gla.ac.uk | 247 |
dictionaryofarthistorians.org | 103 |
en.wikisource.org | 5,923 |
rkd.nl | 2,416 |
structurae.net | 254 |
viaf.org | 387 |
vocab.getty.edu | 33,452 |
www.bbc.co.uk | 9,847 |
www.museothyssen.org | 240 |
www.newulsterbiography.co.uk | 501 |
www.nndb.com | 17,296 |
www.uni-stuttgart.de | 2,465 |
www.wga.hu | 1,577 |
yba.llgc.org.uk | 39 |
Total | 100,266 |
Community Outreach
editDone
edit- Kick-off seminar at FBK, Trento, Italy
- Talk at Wikipedia's 15th anniversary, co-located with the event
Web 3.0, il potenziale del web semantico e dei dati strutturati
, Lugano, Switzerland
- Semantic Web Journal submission
Planned
edit- Semantic Web Journal Revision
- Hackathon at Spaghetti Open Data Reunion: http://www.spaghettiopendata.org/page/benvenut-sod16
- WikiCite 2016: WikiCite_2016
- Poster at Wikimania 2016: https://wikimania2016.wikimedia.org/wiki/Posters
Finances
editPlease take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.
Then, answer the following question here: Have you spent your funds according to plan so far?
Yes. Compared to the initial plan, the only variance (below 10% of the overall budget) is the NLP developer's starting date, which was expected to be at the beginning of the project (11th January 2016), but was actually 1st February 2016. Consequently, the expense difference should move to the project leader budget item. The Finances page is updated accordingly: items are converted in USD and rounded to fit the total budget.
Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.
The dissemination budget item may be lower than the planned one, due to the relatively low cost of the scheduled events. We expect that the variance will be neglectable: we will eventually adjust the item and feed the training set creation or the project leader ones, as needed.
Learning
editThe best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.
What are the challenges
editWhat challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.
Challenges
editAlmost every challenge is technical, and most of them stem from NLP. In order of decreasing impact:
- input corpus
- a relatively big input corpus from several sources introduces the need to cope with high language variability;
- certain documents are written in old English, others stem from the OCR output of a paper scan, etc.
- target lexical database
- it is unlikely that FrameNet would be a perfect fit for the data we aim at generating;
- this especially applies to the crowdsourcing part, since labels and definitions are minted by expert linguists, but cast to non-expert laymen.
- primary sources tool
- contributing to the maintenance of a third-party resource with generally low development activity can be time-consuming;
- it entails various tasks, from understanding possibly undocumented source code, to nudging the maintainers for addressing issues, all the way to accessing the machine that hosts the tool.
- scalability
- it should be always taken into account when writing code.
Going Forward
editKeeping a strong pragmatic attitude in mind:
- praise the unexpected
- we would like to give higher priority to unexpected findings that may have an overall positive impact;
- specifically, we will invest more time in the improvement of the semi-structured dataset, which may cater for a huge amount of unsourced Wikidata statements.
- adapt as needed
- the work package can be modified to suit new tasks, as long as it does not prevent the implementation of planned ones.
- let people play with the data
- the first half of the project was devoted to the back-end development. We expect to engage more and more users once we are able to generate data.
What is working well
editWhat have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
Next steps and opportunities
editWhat are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.
Technical steps
edit- to take special care in running pilot crowdsourcing experiments;
- to build a fully crowdsourced training set;
- to reach a satisfactory performance in the automatic classification;
- to find a suitable mapping to Wikidata for the final datasets.
Opportunities
edit- Presentations and networking at the scheduled events are crucial to engage data donors from third-party Open Data organizations;
- during the development of StrepHit, the team is brought to contribute to external software via standard social coding practices. These are tremendous opportunities that may have a great impact: for instance, we have submitted a pull request to the popular Python documentation generator Sphinx, in order to support the Mediawiki syntax.
Renewal
editWe are definitely considering a renewal of this IEG to extend StrepHit capabilities towards widespread languages other than English (i.e., the current implementation). The project leader has both linguistic and NLP skills to foresee the implementation for Spanish, French, and Italian.
Grantee reflection
editWe’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?
- The IEG/IdeaLab hangout sessions during the project proposal period were really useful and motivating;
- the Wikimedian community seems to have a silent minority, instead of majority: when asking for feedback, we always received constructive answers;
- monthly reports are a nice way to keep fine-grained track of the progress;
- iterative planning is essential to face everyday's technical issues (mostly code libraries and Web services downtimes);
- the team should always take into account completely unexpected changes.