Grants talk:IEG/StrepHit: Wikidata Statements Validation via References/Renewal
Feedback #1
editI have been testing Primary Tools with StrepHit for a few days now. I think it's very promising and in some cases is very helpful.
Here are a few things that would help me for using it more often :
- Primary Tools interface
- Detect an existing value and suggest references to this value instead of creating a new claim.
- Have a way to change the property before approving. Or have a gadget to move a claim from a property to another.
- StrepHit data
- Some trouble identifying the right item.
- The tool makes suggestions for unusual properties and/or unusual values.
For both, have a look at the suggestions for William Anderson (Q15999545) :
- Many birth dates from ADB but none matches as Q15999545 is Canadian not Australian. Maybe the tool should first suggests to link an item to external resources. And once one link has been approved, suggest claims based on the approved resources.
- Unusual properties : I fail to understand why P1542(Cause of) is suggested for a human being .. or P571(Inception) or P1444(destination point).
- Unusual values : Nickname:Hospital ? Really ??
--Melderick (talk) 16:36, 26 July 2016 (UTC)
- @Melderick: thank you for all the insightful comments.
- Concerning the primary sources tool, they definitely support the ongoing discussion (cf. for instance wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements#Edit_the_claim_before_approval).
- With respect to StrepHit data, all the claims are generated automatically by a complex natural language processing pipeline. Let me explain the main reasons of the errors you pointed out:
- the wrong birth dates from ADB attached to William Anderson (Q15999545) entail a wrong disambiguation of the Wikidata Item. In other words, it is very hard to automatically understand which of the 142 (ouch!)
William Anderson
(cf. https://www.wikidata.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=william+anderson&searchToken=e634a92olupbiyfsv13efdb0g) the ADB source is talking about. StrepHit tried to guess, but failed; - the unusual properties are due to the data model mapping. We use Frame Semantics (events) to understand what is happening in a given fragment of text, and this is difficult to translate into Wikidata claims. For instance, if StrepHit generates a claim with d:Property:P1444, it means it has found a travelling event, where the subject has travelled from some origin to some destination. We couldn't find any way to model this in Wikidata, and we believed the generation of a plain claim was the best trade-off;
- the unusual values are likely to be errors of the entity linking module, namely the facility that assigns a Wikipedia link to a fragment of text.
- I hope this was clear! --Hjfocs (talk) 10:41, 2 August 2016 (UTC)
- P.S.: please consider adding a specific section for this discussion topic.
- "We couldn't find any way to model this in Wikidata, and we believed the generation of a plain claim was the best trade-off" I don't think that's a good heuristic. Data quality is very important and we already have many more statements by the tool than we have reviews. In cases where there's no obvious way to model the data in Wikidata I would prefer that no statement is created. It makes sense for the tool to be conservative. ChristianKl (talk) 12:14, 29 August 2016 (UTC)
- @ChristianKl: the first StrepHit datasets are experimental, and the data model mapping was one of the toughest challenges during the development of the pipeline. While I totally agree with your comment, I have given priority to the claim generation in the first release. This was meant to maximize feedback, so thanks a lot for yours! It is noted for the next release. --Hjfocs (talk) 12:29, 6 September 2016 (UTC)
- "We couldn't find any way to model this in Wikidata, and we believed the generation of a plain claim was the best trade-off" I don't think that's a good heuristic. Data quality is very important and we already have many more statements by the tool than we have reviews. In cases where there's no obvious way to model the data in Wikidata I would prefer that no statement is created. It makes sense for the tool to be conservative. ChristianKl (talk) 12:14, 29 August 2016 (UTC)
Feedback #2
editIt says "All the numerical success metrics apply to the primary sources tool and can be assessed via the status page". I don't see how one could write that. Based on that page, the bulk of the edits seem to be done by users like d:Special:Contributions/Crazy1880. Looking at the edits copied below, somehow I doubt much checking was involved and a mere import bot could have done the same. Maybe it's worth looking into the actual StrepHit dataset and attempt to determine what statements were imported, if some check could have been involved and how the spread on a user basis is. This unless one thinks Crazy1880's edit are representative.
sample edits
|
---|
|
--Jura1 (talk) 09:04, 29 July 2016 (UTC)
- @Jura1: thank you for the valuable feedback.
- The indicated measures are strictly quantitative. Hence, they do not take into account the quality of the approval edits, since the main goal of this renewal is to foster the use of the primary sources tool.
- However, I see your point and would be glad to include more detailed qualitative metrics: could you help me understand how the points you raised can be implemented? Specifically, how can we compute:
- whether some check was made before approval?
- the spread on a user basis?
- Cheers, --Hjfocs (talk) 11:28, 2 August 2016 (UTC)
- P.S.: please consider adding a specific section for this discussion topic.
- "Primary Sources" is actually a nice tool. It works mostly and needs maintenance as any tool. I think the main problem with the tool isn't necessarily how it works, but what it's being used for.
- It would probably work well for gathering statements from random sources and suggesting these for checking before addition. Supposedly that is what StrepHit also aims (or aimed) for.
- Initially the tool was used for that: G provided us with a lots of statements and a series of links for these. A problem we found there was that many of the statements may have been good, but the links sometimes didn't even mention the fact.
- Later the tool was suggested for bulk import of various sources (Freebase notably, but also Wikipedia). The first didn't exactly result in as many new statements as some of us had hoped for, the later was mostly done by bot, but also some other tools. I don't think it suitable for bulk import of specific sources: either these (or parts thereof) are sufficiently reliable that we also want a statement that they may have conflicting data or they are sufficiently bad that we don't want them at all. There are several other tools that work for bulk imports (Quickstatements, Mix'n'match, Pasleim's, some bots, etc) and it may be worth developing these for such imports instead.
- Given that we may want a working "Primary Sources"-tool for suitable statements, the question is how to improve this tool. G doesn't seem to have developed it much recently and WMDE should care for its maintenance as they took it on not too long ago. I'm not aware of much they may have done since and what their plan for this may be. Your software development proposal could cover this gap, but maybe you should attempt to formulate at least one or two goals that insert it into WMDE's development going forward.
- BTW the above sample statements should probably be removed as we mostly add them to items for persons instead. --Jura1 (talk) 10:36, 3 August 2016 (UTC)
- Moin Moin Jura1, tell me please, what's the problems with my edits. mfg --Crazy1880 (talk) 17:41, 22 August 2016 (UTC)
- @Jura1: Yes, the code repository has been recently migrated from G to Wikidata [1]. StrepHit has devoted lots of efforts to smooth that migration: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Primary_Sources_Tool_2. Hence, I believe this renewal proposal is in sync with WMDE's development roadmap, which has already stated its support (cf. the first endorsement, i.e., Lydia's: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Endorsements). Most recently, people have been adding a considerable set of issues on the repository, [2] while no progress is being made on the code side (the latest commit was some months ago [3]). I will definitely discuss this aspect with WMDE's management. Cheers --Hjfocs (talk) 13:07, 6 September 2016 (UTC)
Examples
editI thanks User:Hjfocs for the continuous work. I was wondering if it would be possible to point to some concrete examples. I have tried the Primary Sources tool, and it was never clear to me what was Google's contribution and what was the contribution of StrepHit. — Finn Årup Nielsen (fnielsen) (talk) 16:41, 15 August 2016 (UTC)
- @Fnielsen: thank you for your comment. You can have a look at Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Final#Sample_Claims and Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Sample_Statements for some StrepHit examples. Google's contribution was to make Freebase datasets available, while StrepHit released its own ones. You can activate either through the tool preferences. Please follow the instructions for full details: wikidata:Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements. Hope this helps! --Hjfocs (talk) 13:26, 6 September 2016 (UTC)
Hiring an additional person
editFrom reading the Endorsements it seems like there are a lot of people who are in favor of the project. On the other hand getting both the usability, the data quality of StrepHit and it's scope up seems like a lot of work. Have you thought about asking for more money to hire an additional person? ChristianKl (talk) 12:59, 29 August 2016 (UTC)
- @ChristianKl: I agree, and will discuss whether this is possible with the WMF program officers. --Hjfocs (talk) 13:29, 6 September 2016 (UTC)
General policy against providing statements without references
editCurrently the tool often suggests claims without providing any references for them. I don't think that's appropriate behavior for the Primary sources tool. Often I would also like that the tool provides more information than just a link. It should fill "stated in" and "retrieved on". It might also be very helpful to explicitely state that in the references that the reference was created by the tool to provide better provenance. There could also be a bunch of different categories of how the tool creates claims that are specifically mentioned. I would like to be able to write a Query that's aware of the fact that a statement is created by StrepIt. When working with public domain sources from WikiSource I would appreciate usage of "quote (P1683)" with the sentence that the tool considers to contain the relevant information. ChristianKl (talk) 14:03, 29 August 2016 (UTC)
- @ChristianKl: Totally agree with respect to claims with no references. StrepHit was first conceived to always guarantee at least one reference: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#The_Problem. Freebase data contains a lot of unsourced claims, also due to the (collaboratively curated) URL blacklist: wikidata:Wikidata:Primary_sources_tool/URL_blacklist. Cheers --Hjfocs (talk) 14:25, 6 September 2016 (UTC)
- Even for claims from Freebase, Freebase could be given as a source with "stated in : Freebase". I see no reason not to provide that information. ChristianKl (talk) 19:50, 6 September 2016 (UTC)
- I am of the same opinion, and this is absolutely doable from a technical perspective. --Hjfocs (talk) 13:50, 7 September 2016 (UTC)
Feedback #3
editHi Hjfocs. Thanks for rethinking the renewal request and updating it accordingly. Here is some more feedback:
- First off, great to see that you're focusing on Primary Sources Tool. The development of the tool is something the Wikidata community strongly supports and taking the tool to the next level, given how the success of StrepHit at this point relies on it to some extent, is a natural step.
- In Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Scope you've listed some broad and then very specific tasks to be done at the same time. For example, User Interface cannot be separated from the user experience, and user experience is related to Filtering facility. Although it is great if you can specify exact tasks at this early point in the process, it is also fine if you define the set of requirements the tool should meet, and define tasks that help you achieve those requirements later on.
- If you agree with the statement above, you have at least two high level requirements: Technical requirements and User experience (for those who will approve/disapprove references) requirements. (You may have user experience requirements for the admins of the tool, too, as for example, I imagine that someone should ingest the data to the primary sources tool and the work-flow of that person will be different than that of the user who accepts/rejects references.) Now for each of these requirements, you want to define sub-requirements. For example, for Technical requirements: a healthy developer community, documentation on x, y, and z, etc. can be defined as requirements. (btw, it's important that you have agreement with key stakeholders about the requirements you set, I'm sure you're aware of this. :)
- Re Measures of success: I think building a healthy developer community around the tool should be one of the metrics you measure. At the end of the six months period, you should have at least 2 and ideally 3-4 people in the volunteer community who contribute to your code (not so minimally). This will assure some level of continuity, that will help the tool not rely on 1-2 people which is important for a tool with a mission of Primary Sources Tool.
- Re Measure of success again: While I think it would be great to have 30 new primary source tool users at the end of the six month period, I think you have a lot to achieve in the coming months and your focus should be to develop and expand the tool in ways that is sustainable and user friendly. This is already a lot of work. I would focus heavily on development and user testing, and celebrate if I'd have 30+ users, but in imho, that should be a stretch goal.
- Re Dissemination in Budget breakdown: I suggest that you consider going to Wikimedia Developer Summit and even submit a proposal for one of the sessions in here. I can imagine that the venue can help you find other like-minded people to help you with the design and development of the tool. You are also going to be working with one of the very important components of the future of Wikidata, and it's important for the Dev community to have more chances to be aware of it and get involved in it.
- As I mentioned earlier, spending the next six months on development of the tool means you doing less research on StrepHit and I'm really split about that. :) But I understand that this is an important peace of work that needs to be done. Good luck with it, and do ping if you want to brainstorm more about the above.
--LZia (WMF) (talk) 20:53, 14 October 2016 (UTC)
- @LZia (WMF): I am totally grateful for your constructive comments, which are really helpful for the improvement of this proposal. I actually agree with all of them, and have updated the proposal page accordingly.
- One critical point: since the workload seems too high for a 6-month scope, I have redesigned the tasks package to fit into a 1-year plan, which also includes more research on StrepHit (I definitely want to keep up with it!).
- Cheers, Hjfocs (talk) 18:13, 26 October 2016 (UTC)
- @Hjfocs: anytime, and I'm happy to see that you've requested a longer period of time for working on the project. --LZia (WMF) (talk) 20:16, 28 October 2016 (UTC)
[POSTPONED] Review Period through 11/7/16
editHello Hjfocs,
Thank you for submitting this renewal request. We are pleased that you are interested in continuing this project!
Starting now, the Project Grants committee (which has replaced the IEG Committee) and the broader community will have 10 days to share any further thoughts on this request before we move forward with a final decision on whether to renew this IEG under the Project Grants program. I'll be notifying the Project Grants committee and I expect you'll notify any relevant areas of the community that haven't already been made aware of your renewal request.
I've solicited feedback from Dario Taraborelli, Leila Zia and Lydia Pintscher about this proposal and all have indicated they believe it is a strong proposal that they would like to see funded. These are strong endorsements and indicate the value of your project. In the wake of the review period, I will be making a final decision on this renewal on Monday, November 7 (the first business day after the close of the review period).
Thanks again for all your efforts so far!
Warm regards,
Request for postponement of the formal review period
editBased on the latest feedback and given the considerable workload, I would like to rethink the whole work package. Hence, I kindly request to postpone the formal review period until my proposal is revised. I will tackle that as soon as possible, and thank the Project Grants Committee in advance for their patience.
Best,
Hjfocs (talk) 13:42, 26 October 2016 (UTC)
Hjfocs, thanks for posting this request. We will postpone the review period until you give us the green light. Let me know if you need any further support. --Marti (WMF) (talk) 18:43, 26 October 2016 (UTC)
Proposal finalized
edit@Mjohnson (WMF): this is to confirm my proposal has reached its final version. I have redesigned the work package both to fit the primary sources tool workload and to reserve more research efforts to StrepHit.
As you requested, I have also notified further relevant communities, namely the strategic partners: Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Notification.
The notification to the WMF Design Research mailing list is currently awaiting moderator approval: I will post the link as soon as my message gets into the mailing list archives.
Done, Hjfocs (talk) 17:16, 26 November 2016 (UTC)
I will never stop thanking you for your invaluable support.
Best,
Review Period through 12/27/16
editHello Hjfocs
Thank you for submitting this renewal request. We're glad to have the opportunity to continue to partner with you and the StrepHit project.
The response to both your work completed and your work proposed for this project has been very positive. Your outcomes, and your proactive collaboration with stakeholders, have earned the respect of everyone whom I've asked to review your report and your renewal request. Both Dario Tarborelli and Leila Zia have supported funding this request.
Starting now, I'm giving the Project Grants committee (which will review this request in lieu of the disbanded IEG Committee) and the broader community 10 days to share their thoughts on this request before we move forward with a final decision on whether to renew this IEG. I'll be notifying the Project Grants committee and I expect you'll notify any remaining relevant areas of the community you haven't already notified. Thanks again for all your efforts so far!
In the wake of the review period, I will be making a final decision on this renewal on December 28.
Cheers, --Marti (WMF) (talk) 02:27, 18 December 2016 (UTC)
Comments of Ruslik0
editThe projects seems to be an important step in the direction of better utilization of already existing databases into Wikidata. I have not used the Primary Source Tool before but when I tried it I encountered some issues that I think need to be addressed. I can not call it completely unusable but it needs work. So, I support this extension of the initial IEG grant though I have one question. Currently the front-end on Wikidata is mainly maintained by User:Tpt. Are you planning to employ this user in this project? Ruslik (talk) 17:22, 25 December 2016 (UTC)
- @Ruslik0: thanks for your feedback. I have been working intensively with Tpt during the prototype phase of StrepHit. Tpt - together with Denny, Tomayac and SebastianSchaffert - is part of the team that developed the very first version of the primary sources tool. He is still involved in the project, so yes, I expect to collaborate with him a lot.
- Best, Hjfocs (talk) 13:14, 27 December 2016 (UTC)
Renewal approved
editHello Hjfocs, with apologies for the delay over the holidays, I'm approving this renewal for 45,204 €. Congratulations on your progress to date. I look forward to seeing your project evolve into the next stage of work. Our grants administrator, Jtud (WMF) will be in touch soon to setup your renewed agreement.
Warm regards,
Open Position: Web Developer for StrepHit
editAbout the Wikimedia Foundation
editDo you really need a description? We are talking about the organization that hosts Wikipedia!
About StrepHit
editStrepHit is an artificial intelligence that reads free text from Web sources, understands it and feeds Wikidata, Wikimedia's knowledge base.
Its development is currently funded by the Wikimedia Foundation, U.S.A.
Have a look here for a detailed explanation:
Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
Your Role as a Web Developer
editYou will work tightly with Hjfocs, the project leader, and the huge international community of Wikidata users. StrepHit is useless if the tool that displays its data is not adopted. It is called the primary sources tool and was first developed by Google: it's your turn now to get your hands on it.
Responsibilities
editYou will be the one who takes the tool to the next level, making it really usable for hundreds of thousands of Wiki users.
Here is a list of tasks that you should handle:
Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Primary_Sources_Tool
Requirements
edit- You are a B.Sc. or M.Sc. student in Computer Science or related field, or have equivalent experience;
- you still don't know the subject of your thesis and you want to write an amazing one. The project leader will guide you in that;
- you know JavaScript, which doesn't mean you have used jQuery in one project;
- you are a quick-thinker, a problem-solver, who feels comfortable writing code;
- you take code quality seriously: test-driven development, code review, linters, and so on.
Nice to Have
edit- You are familiar with the machinery behind Wikimedia projects, mainly PHP and JavaScript;
- you really know JavaScript;
- you speak Python in your spare time;
- you are familiar with responsive design and mobile-friendliness;
- you are aware of quirks of modern and less modern browsers;
- you would like to learn, challenge yourself, improve and broaden your skill set.
How to Apply
editSend your application to Marco
fossati spaziodati.eu
with subject strephit application
.
Please make sure you include:
- your CV (any format is fine);
- a statement why you'd like to take this position, what you expect and what you think you could bring to the project;
- your GitHub account, some code you've written, an open source project you contributed to, or at least a link to your work that you like (it doesn't matter if it's completely unrelated like a videogame, an art project or whatever else). Priority will be given to application that meet this criterion.
Can't wait to hear back from you! --Hjfocs (talk) 12:22, 3 February 2017 (UTC)
New tabs
editHello Hjfocs,
I just realized that I failed to set up your tabs for your renewal project after approving your project. I apologize for that oversight. It's done now and so your original request and your renewal request will both show in the same landing page on Meta.
Let me know if you have any question.
Kind regards,