Research:Identification of Unsourced Statements/API design research

Tracked in Phabricator:
Task T228816
Created
21:59, 31 July 2019 (UTC)
Duration:  2019-August – 2019-September
This page documents a completed research project.


We will perform interviews with tool developers and subject matter experts to understand how to make the output of the Citation Needed and Citation Reason models easy for them to build tools that use these models to help Wikipedia editors identify and address unsourced statements in Wikipedia articles.

Research goals

edit

This research project is focused on identifying design requirements for a service to make the classifications of the Unsourced Statement models available to volunteer tool developers within the Wikimedia Movement. We hope that these tool developers, in turn, develop tools (e.g. bots, gadgets, web applications) that editors can use to identify and correct unsourced statements.

Labelled data like that produced by these models can be made available to developers in a variety of ways, such as

  • regularly scheduled data dumps
  • a live, queryable database
  • a RESTful API endpoint
  • code libraries that allow developers to host and run the models on their own

The result of this work will:

  1. inform the choice of data service we decide to develop and maintain
  2. identify typical use cases for the data service
  3. inform the specifications and documentation of that service, and
  4. identify further directions for research

Regular Wikipedia editors generally only interact with machine learning models via software tools developed by other volunteers or Wikimedia Foundation staff. The majority of these tools are developed by volunteers, not by WMF product teams. Therefore, our key stakeholder for this research is volunteer tool developers.

Among tool developers, we are particularly interested in speaking to:

  1. The developer of the CitationHunt tool
  2. Volunteers who develop tools for Italian Wikipedia
  3. Volunteers who develop tools to support The Wikipedia Library

Timeline

edit
September 9-13
  • Draft interview protocol (for volunteer interviewees) and submit it to WMF Legal for review
  • Meet with Aaron Halfaker to discuss ORES API design, developer needs, and identify potential interviewees
  • Summarize notes from meeting with Aaron Halfaker
  • Begin reaching out to interviewees (target: 3 interviews)

Collaborators: WMF Legal; WMF Scoring Platform

September 16 - 20
  • finalize interview protocol based on Legal input
  • begin conducting interviews
  • continue reaching out to interview candidates and scheduling interviews as needed
  • summarize notes from interviews conducted

Collaborators: WMF Legal; WMF Scoring Platform

September 23 - 30
  • finish conducting interviews
  • finish summarizing interview notes
  • publish findings, recommendations, and suggest next steps for research and development

Deliverables: Report of findings, recommendations, and implications (wiki page)

November 2019-?
TBD

Policy, Ethics and Human Subjects Research

edit

Interviews and surveys will be conducted according to the Wikimedia Foundation's policies for informed consent. All non-public data gathered during this research (including interview recordings and notes) will be shared and stored in accordance with the Wikimedia Foundation's data retention guidelines.

Desk research

edit

Machine learning as a service API docs

edit

Research on API design

edit
  • Murphy, L., Alliyu, T., Macvean, A., Kery, M. B., & Myers, B. A. (2017). Preliminary Analysis of REST API Style Guidelines. PLATEAU’17 Workshop on Evaluation and Usability of Programming Languages and Tools, 1–9. Retrieved from http://www.cs.cmu.edu/~NatProg/papers/API-Usability-Styleguides-PLATEAU2017.pdf
  • Farooq, U., Welicki, L., & Zirkler, D. (2010). API Usability Peer Reviews: A Method for Evaluating the Usability of Application Programming Interfaces (pp. 2327–2336). ACM. https://doi.org/10.1145/1753326.1753677
  • Rama, G. M., & Kak, A. (2013). Some structural measures of API usability. Software - Practice and Experience. https://doi.org/10.1002/spe.2215
  • Stylos, J., Graf, B., Busse, D. K., Ziegler, C., Ehret, R., & Karstens, J. (2008). A case study of API redesign for improved usability. Proceedings - 2008 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2008, 189–192. https://doi.org/10.1109/VLHCC.2008.4639083
  • Watson, R. B. (2012). Development and application of a heuristic to assess trends in API documentation, 295. https://doi.org/10.1145/2379057.2379112
  • Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2014). OpenML: networked science in machine learning. https://doi.org/10.1145/2641190.2641198
  • Petrillo, F., Merle, P., Moha, N., & Guéhéneuc, Y. G. (2016). Are REST APIs for cloud computing well-designed? An exploratory study. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9936 LNCS, 157–170. https://doi.org/10.1007/978-3-319-46295-0_10
  • Glassman, E. L., Zhang, T., Hartmann, B., & Kim, M. (2018). Visualizing API Usage Examples at Scale, 1–12. https://doi.org/10.1145/3173574.3174154

Interviews

edit

Between September 11 and 23, I conducted six interviews for this project. Interview participants were a mix of direct stakeholders (i.e. tool developers) and subject matter experts. Information about the interview participants is available in the table below.

Interview participant information (click to expand)

Participant Notes
p1 WMF Technology Staff, machine learning model and API developer
p2 WMF Community Engagement Staff & English Wikipedia editor, tool maintainer and program officer for Wikipedia Library
p3 Academic researcher, ORES platform user, tool developer
p4 Wikipedia editor, Tool developer
p5 Wikipedia editor, Tool developer
p6 Wikipedia Education Foundation staff, English Wikipedia editor, tool developer, ORES platform user, product manager for WikiEdu Dashboard

Results

edit


Real-time vs. retrospective classification

edit

Discussions with ORES developers and end users, as well as potential users of Citation Needed recommendations, highlights two primary general use cases for this data service: real time classification of new edits, and retrospective classification of statements that already exist within articles. Different service architectures have different implications for these use cases. An event stream architecture that parses and classifies incoming edits would be useful for patrollers and patrolling tools when monitoring recent changes to a wiki to guard against vandalism and ensure compliance with relevant policies. However, event streams do not support archiving of content or predictions, so this service architecture would not be useful for any quality improvement work that is focused on improving existing articles. In contrast, a data dump service architecture could allow consumers to get classifications for (potentially) any statements that exist on the wiki at the time the dump was generated. However, classifications for statements added to articles since the last dump was generated will not be accessible, and the update cadence for dumps can never support true real-time use cases. A production-level REST API architecture, in the model of ORES, can provide (near) real time classification as well as classification of existing/historical content. Finally, allowing consumers to download and run the citation needed models on their own infrastructure also allows, at least theoretically, both real-time and retrospective classification, presuming that the consumer connects their locally-hosted service platform to existing public WMF-provided streams, APIs, and/or dumps.

Pros and cons of different service architectures

edit

Event streams

edit
Pros
high performance real-time access
Cons
requires substantial engineering work to develop and dedicated on-demand engineering support, hardware and software resources to maintain; no retrospective classification; current event streams that report edit events, which may not correspond to whole sentences (the unit of analysis for citation needed models), and don't provide other metadata such as section heading that are necessary for delivering predictions.

Self-hosting models

edit
Pros
low WMF development and maintenance cost; high flexibility and configurability for consumers with the right skill sets
Cons
consumers must have expertise in implementing, maintaining, testing, and tuning machine learning models; consumers must have access to hardware (servers) capable of running models as well as all external libraries or software components; consumers must develop their own methods for connecting with other public WMF data sources and services.

Data dumps

edit
Pros
low WMF development and maintenance cost; relatively accessible to a broad range of data consumers
Cons
no real-time classification; dump size may present data processing challenges; if dumps are only updated infrequently, it can limit the utility of the classifications.

REST API

edit
Pros
near real-time and retrospective/historical data access; high performance; may be able to leverage existing scoring platform infrastructure and API design patterns
Cons
requires substantial engineering work to develop and dedicated on-demand engineering support, hardware and software resources to maintain; cannot provide real-time classifications as easily as a stream-based service; maintaining performance during high request volume may require extensive caching.

Novel use cases

edit

During the course of the interviews, participants suggested a variety of new use cases for Citation Needed and Citation Reason models.

Bots
edit
  • A bot that adds sentence-level citation needed templates to articles based on model predictions (template can store prediction metadata such as score and reason in template 'reason' field)
  • A bot that adds article- or section-level citation message box templates, based on an cumulative or average of sentence-level citation needed predictions
  • A bot that flags existing sentence-level citation needed templates for human review, and possible removal, based on low 'citation needed' prediction values
  • A bot the generates on-wiki worklists of sentences that especially need of citations, or articles with many such sentences, for WikiProjects and contribution campaigns
MediaWiki UI
edit
  • An extension that highlights uncited sentences that are especially likely to need a citation (helps readers make judgements about the credibility of the information they are reading, similar to WikiTrust[1])
  • A VE add-on that highlights sentences added within the editing interface to help contributors prioritize where to add citations when writing new content (may be especially useful for new editors who don't know citation norms)
  • A task recommendation pane on the Newcomer home page that surfaces sentences that need citations to new editors who are learning the ropes of Wikipedia
Web applications
edit
  • A dashboard that allows groups of editors and editing campaign organizers assign and coordinate citation-adding work within a set of articles, and monitor contributions and track progress towards goals
  • A testbed application that allows anyone to paste a sentence (and surrounding section content) into a webform and generates citation needed and citation reason scores for that sentence
Other
edit
  • Integration with assistive editing programs—e.g. Huggle integration to help patrollers triage content for review, AutoWikiBrowser integration to allow adding of citation needed templates to many articles as a batch process


Conclusion

edit

A RESTful web API will probably provide the greatest utility to the greatest number of data consumers, and supports most if not all existing and proposed use cases. The only major drawback of this approach is that it's relatively expensive (in terms of developer time and hardware/software resources) to build and maintain. An event stream appraoch shares this drawback, but only serves a more limited number of (real time) use cases. Providing models + training and test data for consumers to build on top of and run by themselves is already possible—the code is open source and available on GitHub—but uptake has been low, suggesting that this solution may not be tractable for most consumers.

The data dump-based service appears to support many potential use cases, provided the dumps are easy to process and updated regularly. This may be a productive interim or experimental solution, if it is not feasible or desirable to develop an API endpoint in the near future. Read more on that below.

Recommendation

edit

We should iterate and experiment with lower cost data service options before committing to a more expensive option

edit

By 'low cost' and 'expensive' I mean specifically the cost in terms of WMF developer time dedicated to initial development, refinement, and ongoing maintenance. 'Sunk cost' can should also be factored in; for example, if we decide at some point in the future that we no longer want to provide this data service at all, in any form, what is the total cost up to that point? To some extent, overall cost can also reflect the cost of hardware necessary to provide access to the service on an ongoing basis. For example, what is the difference in hardware requirements (CPUs, GPUs, storage) and overall energy expenditure required to provide citation needed predictions in real time (API or stream) vs. via static code and data files (data dump or model self-hosting)? However, hardware, energy, and storage requirements for different service architectures was not a topic that was extensively discussed during these interviews, so relative comparisons and tradeoffs are somewhat less certain in this respect.

If we start with a lower cost option that supports enough use cases that it is likely to encourage some level of adoption among existing tool developers for use in existing or potential tools, we have the opportunity to gather feedback from these developers (and the end-users of their tools) that we can use to iterate and identify user and technical requirements that will help us assess both the need for, and feasibility of, more expensive options down the line. As a result, if or when we decide to invest more in this data service, we will have a stronger business case for doing so.

Our initial data service architecture should be based on regularly-scheduled data dumps

edit

There are several reasons to prefer a data dump-based solution to streams, APIs, or model self-hosting, at least initially.

  1. A data dump service has the potential to support a large number and a wide variety of use cases. Many of the identified use-cases for citation needed models, especially among volunteer-developed and maintained tools, do not require either on-demand or real-time access to predictions. Many valuable bot-, assisted editing program-, and campaign/worklist-based use cases for citation needed predictions can be served by data dumps that are updated at regular intervals.
  2. A data dump service is comparatively easy to create and maintain. Developing a dump schema, generating data dumps, and hosting and updating those dumps can likely be done without pushing code to production, and may create fewer dependencies and require less code review than building APIs or streams. This may facilitate easier collaboration among WMF staff engineers and volunteers. Maintaining uptime may also be less of an issue, since the code runs on a fixed schedule.
  3. A data dump service is comparatively easy to modify and extend. Making changes to dump schemas, update schedules, bug fixes, file formats and sizes, documentation etc. can often be done without creating problematic dependencies. Maintaining backwards-compatibility may also be less of an issue with dumps, since the cost of providing continued access to dumps formatted according to legacy schemas (to support tools or workflows built around those schemas) is relatively low.
  4. A data dump service is compatible with existing Citation Hunt workflows. Citation Hunt currently updates its internal databases through a series of scheduled batch query and update operations. Integrating citation needed predictions into Citation Hunt may be primarily a matter of matching its architecture and update cadence with Citation Hunt's existing input schemas and data ingestion schedules.


Initial data service architecture should be designed to support easy Citation Hunt integration

edit

Currently, the clearest and most impactful use case for the citation needed models is Citation Hunt. Based on my discussion of use cases, user requirements, and technical requirements with tool developers and subject matter experts, I recommend that we design the initial data service architecture with Citation Hunt in mind.

Citation Hunt is available across 20 Wikipedia languages, used by many editors in the course of their work, and also frequently used in the context of edit-a-thons, classes, and contribution campaigns like 1lib1ref. This presents us with the opportunity to pilot the citation needed predictions in contexts where the user is working independently, and where the user is working as part of a coordinated effort to improve a particular area of Wikipedia.

Citation Hunt is under active development[2] by SurlyCyborg, who has indicated a willingness to partner with us on back-end integration and on experimenting with new user-facing features. This presents us with potential opportunities to get feedback and re-label data from within the app. Some UI interventions we could try out include:

  1. providing end users with prediction scores, indicating greater and lesser degrees of model certainty about the prediction
  2. providing end users with information about which words in the statement the model is "paying attention" to help them understand the prediction
  3. allowing end users to flag false positives (statement does not need a citation after all), and to specify new (or additional) reasons why the citation is needed
  4. providing an in-app link to a dedicated feedback forum where end users can provide richer feedback on the model predictions, report issues, or ask questions
  5. identify "time consuming judgement calls" (statements that are surfaced to, and repeatedly skipped, by many app users)
  6. triaging which statements are served to the user via app based on prediction strength (currently the order is somewhat random, as the tool has no mechanism for prioritizing which statements tagged "citation needed" to serve to the end user)
  7. triaging which statements are served to the user based on citation reason, allowing end users to focus on providing verification for particular kinds of factual claims that they enjoy hunting up citations for (for example, "statistics or data", or "historical fact").

Existing functionality within Citation Hunt, such as the ability to search for statements that need citations within a particular category or pulled from an arbitrary set of pre-specified articles, suggests a variety of exciting and productive uses for citation needed models—for example, improving the quality of Biographies of Living Persons, which are held to a higher standard of verifiability per Movement-wide policy.

Furthermore, as Citation Hunt currently relies on statements that have a "citation needed" tag, our models present an opportunity to dramatically expand the set of potential citation-needing statements that can be surfaced through Citation Hunt—there are many more statements that could benefit from a citation than there are statements that have already been marked as such.

Citation Hunt maintains its own database of statements that are surfaced via the app. This database is refreshed via scheduled jobs that track the addition or removal of citation needed templates. The data dump update cadence should be designed to match the current update cadence of this database. The schema and size of the dumps should be developed in collaboration with Citation Hunt's maintainer, to ensure ease of integration.

It is possible that the schema that works for citation hunt will also be useful for bots (e.g. a bot that flags sentences "citation needed") or assisted editing programs (e.g. AutoWikiBrowser, which can ingest dumps and allows editors to perform batch actions across many pages). When developing the data dump service for Citation Hunt, we will attempt to provide the kind of metadata, the update cadence, and overall file structure to accommodate these additional use cases as well. Additional feedback from other tool developers (as well as researchers) on the model predictions and the dump service architecture will help us refine the data service and the models, and potentially generate requirements for more powerful data service(s) in the future.

In the long term we should consider developing a RESTful web API to provide citation needed scores

edit

The primary limitations of data dumps are:

  1. they don't support real time predictions ("does this sentence that was just added to Wikipedia need a citation?")
  2. they don't support predictions on demand ("does this sentence I just stumbled upon in an article, or this sentence that I would like to add to Wikipedia, require a citation?")
  3. they aren't particular easy to process or filter

A RESTful web API can address these limitations. The ORES platform was designed to provide machine learning predictions on demand, at scale. The "damaging" "good faith" "article quality" "draft quality" and "draft topic" models hosted on ORES are currently used to power bots, assisted editing programs, dashboards, and MediaWiki gadgets and extensions—as well as for research projects and program/campaign evaluations.

This high degree of uptake across such a wide range of users and use cases is remarkable and exciting! We believe that the citation needed models have the potential to provide a similar degree of broad utility and encourage community-driven innovation as the current ORES-hosted models, provided they are made accessible. We would also like to explore the degree to which the citation needed models are compatible with some of the most innovative capabilities of the ORES API, such as the ability to inspect and inject features.

We believe we will be better positioned to assess the feasibility, demand, and design requirements of an API-based service after piloting the citation needed models as data dumps.

See also

edit

References

edit