Research:Test External AI Models for Integration into the Wikimedia Ecosystem

Tracked in Phabricator:
Task T369281
Duration:  2024-07 – 2024-12

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


As part of our contributions to WMF's 2024-2025 Annual Plan, Research and collaborators are working on identifying which AI and ML technologies are ready for WMF to start testing with (at the feature, product, ... levels), among the sea of models that are out there and continue to be made available.

Hypothesis Text

edit

Q1 Hypothesis

edit

If we gather use cases from products and feature engineering managers about the use of AI in Wikimedia services for readers and contributors, we can determine if we should test and evaluate existing AI models with the goal of integrating them into product features, and if yes, generate a list of candidate models to test.

Q2 Hypothesis

edit

If we test the accuracy and infrastructure constraints of 4 existing AI language models for 2 or more high-priority product use-cases, we will be able to write a report recommending at least one AI model that we can use for further tuning towards strategic product investments.

Methods and Tasks

edit

Define and prioritize existing use cases for AI integration into products

edit

See T370134

Use case definition

edit
  1. Gather documented use cases from Product teams based on past conversations and draft an initial list, organized by task type, intended audience and impact.
  2. Conduct a set of interviews with 7 product leaders to gather their perspectives on AI product needs. This instruction revealed three use cases; OCR, image vandalism detection and talk page translation. Most of the previously identified cases were confirmed and refined based on their feedback, and we also gained early insights into high and low priority use cases.
  3. Survey Product Managers for additional input. We asked 13 product managers to review the current list of use cases and identify top-priority, low-priority, and any missing cases. The ranked use cases largely aligned with the initial feedback from leaders. The top priorities included edit-check-related tasks, such as automatically assigning categories to articles (useful beyond edit checks) and identifying policy violations. Structured tasks and mobile-friendly features, like automatic article outlines and worklist generation, were also highly ranked, along with automated image tagging and descriptions.

Results after this stage are here

Use case prioritization

edit

After reviewing responses from the above process, and after the model selection phase is completed, we rank and select use cases based on the following criteria:

  1. Priority signaled: Has this use case mentioned during conversations with Product Leadership or as part of the PM survey as a top-priority use case?
  2. Ai Strategy Alignment: Is the use case aligned with WMF’s new AI Strategy?
  3. Model availability: Have we identified existing models developed externally during the model selection phase that can be applied to this use case, based on the criteria of effectiveness, multilingualism, infrastructure and openness?
  4. Data Availability: Do we have enough labeled data to test? If not, what will it take to compile the necessary data for example, through crowdsourcing or manual evaluation?
  5. Measurability: Can we, in practice, estimate the effectiveness of existing models on the proposed use cases based on quantitative indicators?

Define a set of criteria to identify existing models to test, and select candidate models for use-cases

edit

We review literature on existing AI models to find good matches for each use-case defined above based specific criteria.

Criteria for selecting models to be tested

edit
  1. Effectiveness: Does the model have the potential to perform well for the specific use case based on previous research or similar tasks? Has the model been applied successfully to similar task or domains, demonstrating its effectiveness?
  2. Multilingualism: Does the model support the languages required by the use case? In general, is the model designed to handle multiple languages, and has it been trained and tested on languages other than English?(intentionally trained/tested on multiple languages)?
  3. Infrastructure: Can the model be hosted within our current infrastructure(e.g., LiftWing)? Is the model adaptable within our systems, and have similar models been successfully hosted before? ? Is there a contingency plan for hosting the model externally if necessary?
  4. Openness: Is the model open-source or available through accessible platforms such as Hugging face? Does the model's licensing allow for testing, modification, and use in production if needed? Is there sufficient public documentation regarding the model’s architecture, training data, and other relevant aspects?

Define a protocol for external model evaluation

edit

Test models on WMF infrastructure

edit

Timeline

edit

[Q1 24-25] Tasks 1 and 2 [Q2 24-25] Tasks 3 and 4

Results

edit

TODO: Add initial results for each task when ready

Provisional List of Defined Product-AI Use-Cases

edit

Macro-Category Use Case Audience What could this help our movement learn/achieve? Impact I Impact II
Structured/Edit Tasks Detect grammar / typos / misspellings

Detect errors in text and propose ways to correct them

Contributors Support Newcomers Automate Patrolling
Structured/Edit Tasks Detect valid categories for Wikipedia articles

Given an article, recommend the top X categories that the articles could be tagged with

Contributors Support Newcomers Address Knowledge Gaps
Structured/Edit Tasks Detect Policy violations: e.g., WP:NPOV; WP:NOR Contributors "Moderators can see 'non-neutral language' in an article highlighted automatically; edit checks for newcomers;

Editors could accept suggested corrections to edits based on policies and norms"

Improve Content Integrity Support Newcomers
T&S/Moderator Tools Talk page tone detection

Detect negative sentiments and harassment in talk page conversations

Readers and Contributors Functionaries can see talk page tone and manner issues in contributor stats; Editors receive constructive feedback on their tone/manner in talk page discussions. A Reader can see if an article has an unusual debate profile on its talk pages. Automate Workflows Improve Content Integrity
Structured/Edit Tasks Source verification

Verify that the text in an article is supported by the source specified in its inline citation

Contributors Use LLMs to find new or better sources for claims on Wikipedia Improve Content Integrity Automate Workflows
T&S/Moderator tools Talk page summaries

Generate summaries of talk pages highlighting the main points of discussion and the final consensus

Contributors New editors can generate a summary of talk page dialog before joining the discussion. Moderators can generate a summary of a discussion Automate Patrolling Support Newcomers
Structured/Edit Tasks Automatic article outlines

Generate a structure of sections and subsections for a new article

Contributors Editors can automatically generate outlines for articles they want to write. Automate Workflows
Reader Tools Article summaries

Summarize the content of an article in a few sentences

Readers Readers can browse summaries of articles related to the article they’re on; The platform provides an article summary API for first or 3rd party use Improve Content Discovery Retain New Readers
Structured/Edit Tasks Wikipedia text generation from sources

Generate sentences or paragraphs for a Wikipedia article based on existing reliable sources

Contributors Achieve new content. Inspire new/old editors that prefer draft suggestions instead of editing from scratch. Use GenAI to generate suggestions to Wikipedia articles based on given source content. Automate Workflows Support Newcomers
Reader Tools Text to speech

Audio format for the encyclopedic content

Readers Readers can access articles or content in audio format. This could be just pronunciation or full article audio. Accessibility
Structured/Edit Tasks Automated image metadata tagging

Tag images on Wikipedia and Commons with relevant Wikidata items

Readers and Contributors Commons users can search using intuitive key words and find images that have been tagged in arcane ways. Editors can browse and easily add images related to the topic they’re editing. + semi-automated image description generation for structured tasks Improve Search Automate Workflows
Reader Tools Automated Q/A generation from Wikipedia articles

Generate questions and answers that can help navigate the content of a Wikipedia article

Readers Use AI to autogenerate quizzes on articles; Readers can see all the questions the article they're reading has answers to Improve Content Discovery Retain New Readers
WikiSource Optical Character Recognition system

Digitizing documents require an OCR system that works for all the languages we support.

Contributors Knowledge processing tools like OCR helps volunteers to digitize documents for projects like Wikisource. These tools are not easy to find for low resource languages. Assisting them with right tools helps them to contribute more and save time Automate Workflows Address Knowledge Gaps
T&S/Moderator Tools Image Vandalism detection

Detect images that are maliciously added to articles

Contributors Patrollers can visualize images that appear to be out-of-context or misplaced in wikipedia articles Automate Workflows Improve Content Integrity
T&S/Moderator Tools Automated worklist generation

Generate lists of articles that are relevant to an editor and that need improvement

Contributors Editors have an automatically generated list of articles they've contributed to that need additional work Automate Workflows
T&S/Moderator Tools Edit Summaries

Given an edit, generate a meaningful summary of what happened in the edit (and why)

Contributors Editors can automatically get an Edit Summary generated from their edits. Patrollers can see a summary of a user's recent edits Automate Workflows Automate Patrolling
Reader Tools Automated reading list generation

Recommend relevant articles to read based on current reader interest

Readers A Reader can access a list of "the next 5 things you might want to read, based on what you've already read this session". Improve Content Discovery Retain New Readers
Structured/Edit Tasks Suggest Templates for a given editor

Retrieve relevant templates for a Wikipedia articles

Contributors Help us learn whether we can effectively and reliably suggest templates for users who want to inset a template on a page.

Measured as: when a user views "suggested templates" they insert a suggested template 20% of the time

Support Newcomers Automate Workflows
Structured/Edit Tasks Policy discovery during editing

Retrieve policies that are relevant to the current edit activities

Contributors Editors can easily find documentation on policies and norms; Relevant policies are automatically surfaced to editors in the edit workflow;

New editors can ask questions to get help with editing

Support Newcomers Automate Workflows
Wishlist Talk page translations

See translation of messages if that is in a different language

Readers and Contributors Discussion venues supports multilingualism. Can be helpful for ambassadors posting messages in various wikis, Meta wiki discussions, Community Wishlist discussions, strategy discussions and so on. Community inclusivity Multilingual discussions and communication
Reader Tools Improve Natural Language Search

Improve search so that people can ask questions in natural language

Readers Readers can ask questions in natural language to retrieve answers and articles ; can pull out factoids from deep in Articles; can navigate directly to relevant anchor links in articles Improve Search
Structured/Edit Tasks De-orphaning articles (suggest articles related to orphans) Contributors Improve Content Discovery Support Newcomers

Selected Use Cases

edit

Task 1: Automatic Article Categorization

edit

This task is a generic task supporting many product features. The goal is to automatically associate an article with one or more categories from the Wikipedia category network. In the past, this task has been attempted at a relatively small scale (e.g. few categories or small semantic span [1]). We are attempting this task at a much larger scale using LLMs. Early explorations suggest that recommendations seem to be accurate, however, the prompting strategy needs to include “constrained decoding” to enforce output is from the set of categories in Wikipedia.

Task 2: NPOV violation detection

edit

The second major area we identified as a pool of tasks for testing is Policy Violation detection, in support of product features such as Edit Checks. More specifically, we would like to see the extent to which existing LLMS are able to detect violations to the NPOV policy. While recent research [2] has shown that existing LLMs are not necessarily accurate do detect these kinds of policy violations, we aim at extending those tests to more languages, and compare with simpler baselines.

Task 3: Peacock behavior detection

edit

This is a special case of Task 2, which should be more constrained and feasible, given that detecting Peacock behavior is strictly a language understanding problem. However, recent experiments showed that small LLMs are not sufficiently well trained to detect this type of behavior. We aim here to expand these experiments to more and larger models and languages.

Selected Models

edit

Evaluation Protocol

edit

Model Test Results

edit

Resources

edit

References

edit
  1. Shavarani, Hassan S., and Satoshi Sekine. "Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set." arXiv preprint arXiv:1909.06502 (2019).
  2. Ashkinaze, Joshua, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak, and Eric Gilbert. "Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms." arXiv preprint arXiv:2407.04183 (2024).