Research:Newsletter/2025/January

Vol: 15 • Issue: 01 • January 2025 [contribute] [archives]

GPT-4 writes better edit summaries than human Wikipedians

GPT-4 is better at writing edit summaries than human Wikipedia editors

A preprint^[1] by researchers from EPFL and the Wikimedia Foundation presents

Edisum, which is, to the best of our knowledge, the first solution to automate the generation of highly-contextual Wikipedia edit summaries [given an edit diff] at large scale, [and] achieves performance similar to the human editors

Average aggregated human evaluation scores for edit summaries generated by language models and by the human editors who originally made the edits

This solution was designed to match the performance and open source requirements for a live service deployed on Wikimedia Foundation servers. It consists of a "very small" language model (ca. 220 million parameters), based on Google's LongT5 (an extension of the company's T5 model from 2019, available under an Apache-2.0 license).

Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where "to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation").

Examples of summaries written by human editors and generated by Edisum and GPT-4
edit human editor Edisum [0%] Edisum [100%] GPT-4

[1] added the range added location Added location information Added information about the location and altitude of Boveys Lookout and the orientation of the Burton Range.

[2] in 1982 Added date and citation. Added the year of the speech Added the year of Fidel Castro's speech.

[3] added artist Added a wikilink and a link Added a new name to the list of artists Added Stefan Brüggemann to the list of artists whose works are included.

[4] lede; infobox infobox (inapplicable); lede; -cats Clarified the nature of Denis- Auguste Affre's cause and added information about his title Updated the description of Denis-Auguste Affre's religious status and clarified the stage of his canonization process.

[5] fixed link to aias awards, it was at 5th, not 7th 5th Corrected the award number from 7th to 5th Corrected the year of the Annual Interactive Achievement Awards nomination.

[6] Amended lists so they are split at 100 m ""at least", not "at least"" Clarified the range of heights in the list Added information about two different height categories for the ranking of buildings and structures in Greater Manchester.

[7] UPD Romelu Lukaku updated Lukaku Removed the end of the season Removed incorrect future date for Romelu Lukaku's tenure at Inter.

[8] "This \"however\" doesn't make sense here" "rmv deprecated, ambiguous "however" per MOS:RELTIME" Removed unnecessary word Removed the word "However," from the beginning of the sentence.

[9] Per feedback given in GA review, elaborated on David King Udall and Ella Stewart Udall in the lead and in Ella Udall's first mention; David Udall was already identified Added link; improved phrasing and punctuation Added additional information about the individuals' professions Added professional details for David King Udall, Ella Stewart Udall, and Mary Ann Linton Morgan Udall.

This outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: "'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles").

A substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to "overcome this problem and help editors write useful edit summaries":

"An edit summary is a succinct comment written by a Wikipedia editor explaining the nature of, and reasons for, an edit to a Wikipedia page. Edit summaries are crucial for maintaining the encyclopedia: they are the first thing seen by content moderators and they help them decide whether to accept or reject an edit. [...] Unfortunately, as we show, for many edits, summaries are either missing or incomplete."

In more detail:

"Given the dearth of data on the nature and quality of edit summaries on Wikipedia, we perform qualitative coding to guide our modeling decisions. Specifically, we analyze a sample of 100 random edits made in August 2023 to English Wikipedia [removing bot edits, edits with empty summaries and edits related to reverts] stratified among a diverse set of editor expertise levels. Two of the authors each coded all 100 summaries [...] by following criteria set by the English Wikipedia community (Wikimedia, 2024a) [...]. The vast majority (∼80%) of current edit summaries focus on [the] “what” of the edit, with only 30–40% addressing the “why”. [...] A sizeable minority (∼35%) of edit summaries were labeled as “misleading”, generally due to overly vague summaries or summaries that only mention part of the edit. [...] Almost no edit summaries are inappropriate, likely because highly inappropriate edit summaries would be deleted (Wikipedia, 2024c) by administrators and not appear in our dataset."

Metric Summary (what) Explain (why) Misleading Inappropriate Generate-able (what) Generate-able (why)

Description Attempts to describe what the edit did. For example, "added links" Attempts to describe why the edit was made. For example, "Edited for brevity and easier reading". Overly vague or misleading per English Wikipedia guidance. For example, "updated" without explaining what was updated is too vague. Could be perceived as inappropriate or uncivil per English Wikipedia guidance. Could a language model feasibly describe the "what" of this edit based solely on the edit diff. Could a language model feasibly describe the "why" of this edit based solely on the edit diff.

% Agreement 0.89 0.8 0.77 0.98 0.97 0.8

Cohen's Kappa 0.65 0.57 0.50 -0.01 0.39 0.32

Overall (n=100) 0.75 - 0.86 0.26 - 0.46 0.23 - 0.46 0.00 - 0.02 0.96 - 0.99 0.08 - 0.28

IP editors (n=25) 0.76 - 0.88 0.20 - 0.44 0.40 - 0.64 0.00 - 0.08 0.92 - 0.96 0.04 - 0.16

Newcomers (n=25) 0.76 - 0.84 0.36 - 0.48 0.24 - 0.52 0.00 - 0.00 0.92 - 1.00 0.12 - 0.20

Mid-experienced (n=25) 0.76 - 0.88 0.28 - 0.52 0.16 - 0.36 0.00 - 0.00 1.00 - 1.00 0.08 - 0.28

Experienced (n=25) 0.72 - 0.84 0.20 - 0.40 0.12 - 0.32 0.00 - 0.00 1.00 - 1.00 0.08 - 0.48

"Table 1: Statistics on agreement for qualitative coding for each facet and the proportion of how many edit summaries met each criteria. Ranges are a lower bound (both of the coders marked an edit) and an upper bound (at least one of the coders marked an edit). The majority of summaries are expressing only what was done in the edit, which we also expect a language model to do. A significant portion of edits is of low quality, i.e., misleading."

The paper discusses various other nuances and special cases in interpreting these results and in deriving suitable training data for the "Edisum" model. (For example, "edit summaries should ideally explain why the edit was performed, along with what was changed, which often requires external context" that is not available to the model – or really to any human apart from the editor who made the edit.) The authors' best performing approach relies on fine-tuning the aforementioned LongT5 model on 100% synthetic data generated using a LLM (gpt-3.5-turbo) as an intermediate step.

Overall, they conclude that

while it should be used with caution due to a portion of unrelated summaries, the analysis confirms that Edisum is a useful option that can aid editors in writing edit summaries.

The authors wisely refrain from suggesting the complete replacement of human-generated edit summaries. (It is intriguing, however, to observe that Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years. And the present paper exclusively focuses on English Wikipedia – Wikipedias in other languages might have fairly different guidelines or quality issues regarding edit summaries.)

Still, there might be great value in deploying Edisum as an opt-in tool for editors willing to be mindful of its potential pitfalls. (While the English Wikipedia community has rejected proposals for a policy or guideline about LLMs, a popular essay advises that while their use for generating original content is discouraged, "LLMs can be used for certain tasks (like copyediting, summarization, and paraphrasing) if the editor has substantial prior experience in the intended task and rigorously scrutinizes the results before publishing them.")

On that matter, it is worth noting that the paper was first published (as a preprint) ten months ago already, in April 2024. (It appears to have been submitted for review at an ACL conference, but does not seem to have been published in peer-reviewed form yet.) Given the current extremely fast-paced developments in large language models, this likely means that the paper is already quite outdated concerning several of the constraints that Edisum was developed for. Specifically, the authors write that

commercial LLMs [like GPT-4] are not well suited for [Edisum's] task, as they do not follow the open-source guidelines set by Wikipedia [referring to the Wikmedia Foundation's guiding principles]. [...Furthermore,] the open-source LLM, Llama 3 8B, underperforms even when compared to the finetuned Edisum models.

But the performance of open LLMs (at least those released under the kind of license that is regarded as open-source in the paper) has greatly improved over the past year, while the costs of using LLMs in general have dropped.

Besides the Foundation's licensing requirements, its hardware constraints also played a big role:

We intentionally use a very small model, because of limitations of Wikipedia’s infrastructure. In particular, Wikipedia [i.e. WMF] does not have access to many GPUs on which we could deploy big models (Wikitech, 2024), meaning that we have to focus on the ones that can run effectively on CPUs. Note that this task requires a model running virtually in real-time, as edit summaries should be created when edit is performed, and cannot be precalculated to decrease the latency.

Here too one wonders whether the situation might have improved over the past year since the paper was first published. Unlike much of the rest of the industry, the Wikimedia Foundation avoids NVIDIA GPUs because of their proprietary CUDA software layer and uses AMD GPUs instead, which are known for having some challenges in running standard open LLMs – but conceivably, AMD's software support and performance optimizations for LLMs might have been improving. Also, given the size of WMF's overall budget, it seems interesting that compute budget constraints would apparently prevent the deployment of a better-performing tool for supporting editors in an important task.

Briefly

Submissions are open until March 9, 2025 for Wiki Workshop 2025, to take place on May 21-22, 2025. The virtual event will be the 12th in this annual series (formerly part of The Web Conference), and has been extended from one to two days this time. It is organized by the Wikimedia Foundation's research team with other collaborators. The call for contributions asks for 2-page extended abstracts which will be "non-archival, which means that they can be ongoing, completed, or already published work."
See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs"

From the abstract:^[2]

"Several initiatives have been undertaken to conceptually model the domain of scholarly data using ontologies and to create respective Knowledge Graphs. [...] Our main contributions include (a) an analysis of ontologies for representing scholarly data to identify gaps and relevant entities/properties in Wikidata, (b) semi-automated extraction – requiring (minimal) manual validation – of conference metadata (e.g., acceptance rates, organizer roles, programme committee members, best paper awards, keynotes, and sponsors) from websites and proceedings texts using LLMs. Finally, we discuss (c) extensions to visualization tools in the Wikidata context for data exploration of the generated scholarly data. Our study focuses on data from 105 Semantic Web-related conferences and extends/adds more than 6000 entities in Wikidata. It is important to note that the method can be more generally applicable beyond Semantic Web-related conferences for enhancing Wikidata's utility as a comprehensive scholarly resource."

"Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence"

This study uses Wikipedia articles about neighborhoods in Madrid and Barcelona to predict immigrant concentration and segregation. From the abstract:^[3]

"The scientific literature on residential segregation in large metropolitan areas highlights various explanatory factors, including economic, social, political, landscape, and cultural elements related to both migrant and local populations. This paper contrasts the impact of these factors individually, such as the immigrant rate and neighborhood segregation. To achieve this, a machine learning analysis was conducted on a sample of neighborhoods in the main Spanish metropolitan areas (Madrid and Barcelona), using a database created from a combination of official statistical sources and textual sources, such as Wikipedia. These texts were transformed into indexes using Natural Language Processing (NLP) and other artificial intelligence algorithms capable of interpreting images and converting them into indexes. [...] The novel application of AI and big data, particularly through ChatGPT and Google Street View, has enhanced model predictability, contributing to the scientific literature on segregated spaces."

"On the effective transfer of knowledge from English to Hindi Wikipedia"

From the abstract:^[4]

"[On Wikipedia, t]here is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia's distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations."

References

↑ Šakota, Marija; Johnson, Isaac; Feng, Guosheng; West, Robert (2024-04-04), Edisum: Summarizing and Explaining Wikipedia Edits at Scale, arXiv, doi:10.48550/arXiv.2404.03428 Code models
↑ Mihindukulasooriya, Nandana; Tiwari, Sanju; Dobriy, Daniil; Nielsen, Finn Årup; Chhetri, Tek Raj; Polleres, Axel (2024-11-13), Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs, arXiv, doi:10.48550/arXiv.2411.08696 Code / dataset
↑ López-Otero, Javier; Obregón-Sierra, Ángel; Gavira-Narváez, Antonio (December 2024). "Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence". Social Sciences 13 (12): 664. ISSN 2076-0760. doi:10.3390/socsci13120664.
↑ Das, Paramita; Roy, Amartya; Chakraborty, Ritabrata; Mukherjee, Animesh (2024-12-07), On the effective transfer of knowledge from English to Hindi Wikipedia, arXiv, doi:10.48550/arXiv.2412.05708

Wikimedia Research Newsletter
Vol: 15 • Issue: 01 • January 2025
About • Subscribe: Email • [archives] • [Signpost edition] • [contribute] • [research index]

[1] Šakota, Marija; Johnson, Isaac; Feng, Guosheng; West, Robert (2024-04-04), Edisum: Summarizing and Explaining Wikipedia Edits at Scale, arXiv, doi:10.48550/arXiv.2404.03428 Code models

[2] Mihindukulasooriya, Nandana; Tiwari, Sanju; Dobriy, Daniil; Nielsen, Finn Årup; Chhetri, Tek Raj; Polleres, Axel (2024-11-13), Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs, arXiv, doi:10.48550/arXiv.2411.08696 Code / dataset

[3] López-Otero, Javier; Obregón-Sierra, Ángel; Gavira-Narváez, Antonio (December 2024). "Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence". Social Sciences 13 (12): 664. ISSN 2076-0760. doi:10.3390/socsci13120664.

[4] Das, Paramita; Roy, Amartya; Chakraborty, Ritabrata; Mukherjee, Animesh (2024-12-07), On the effective transfer of knowledge from English to Hindi Wikipedia, arXiv, doi:10.48550/arXiv.2412.05708

[1]

[2]

[3]

[4]

edit	human editor	Edisum [0%]	Edisum [100%]	GPT-4
[1]	added the range	added location	Added location information	Added information about the location and altitude of Boveys Lookout and the orientation of the Burton Range.
[2]	in 1982	Added date and citation.	Added the year of the speech	Added the year of Fidel Castro's speech.
[3]	added artist	Added a wikilink and a link	Added a new name to the list of artists	Added Stefan Brüggemann to the list of artists whose works are included.
[4]	lede; infobox	infobox (inapplicable); lede; -cats	Clarified the nature of Denis- Auguste Affre's cause and added information about his title	Updated the description of Denis-Auguste Affre's religious status and clarified the stage of his canonization process.
[5]	fixed link to aias awards, it was at 5th, not 7th	5th	Corrected the award number from 7th to 5th	Corrected the year of the Annual Interactive Achievement Awards nomination.
[6]	Amended lists so they are split at 100 m	""at least", not "at least""	Clarified the range of heights in the list	Added information about two different height categories for the ranking of buildings and structures in Greater Manchester.
[7]	UPD Romelu Lukaku	updated Lukaku	Removed the end of the season	Removed incorrect future date for Romelu Lukaku's tenure at Inter.
[8]	"This \"however\" doesn't make sense here"	"rmv deprecated, ambiguous "however" per MOS:RELTIME"	Removed unnecessary word	Removed the word "However," from the beginning of the sentence.
[9]	Per feedback given in GA review, elaborated on David King Udall and Ella Stewart Udall in the lead and in Ella Udall's first mention; David Udall was already identified	Added link; improved phrasing and punctuation	Added additional information about the individuals' professions	Added professional details for David King Udall, Ella Stewart Udall, and Mary Ann Linton Morgan Udall.

Metric	Summary (what)	Explain (why)	Misleading	Inappropriate	Generate-able (what)	Generate-able (why)
Description	Attempts to describe what the edit did. For example, "added links"	Attempts to describe why the edit was made. For example, "Edited for brevity and easier reading".	Overly vague or misleading per English Wikipedia guidance. For example, "updated" without explaining what was updated is too vague.	Could be perceived as inappropriate or uncivil per English Wikipedia guidance.	Could a language model feasibly describe the "what" of this edit based solely on the edit diff.	Could a language model feasibly describe the "why" of this edit based solely on the edit diff.
% Agreement	0.89	0.8	0.77	0.98	0.97	0.8
Cohen's Kappa	0.65	0.57	0.50	-0.01	0.39	0.32
Overall (n=100)	0.75 - 0.86	0.26 - 0.46	0.23 - 0.46	0.00 - 0.02	0.96 - 0.99	0.08 - 0.28
IP editors (n=25)	0.76 - 0.88	0.20 - 0.44	0.40 - 0.64	0.00 - 0.08	0.92 - 0.96	0.04 - 0.16
Newcomers (n=25)	0.76 - 0.84	0.36 - 0.48	0.24 - 0.52	0.00 - 0.00	0.92 - 1.00	0.12 - 0.20
Mid-experienced (n=25)	0.76 - 0.88	0.28 - 0.52	0.16 - 0.36	0.00 - 0.00	1.00 - 1.00	0.08 - 0.28
Experienced (n=25)	0.72 - 0.84	0.20 - 0.40	0.12 - 0.32	0.00 - 0.00	1.00 - 1.00	0.08 - 0.48