Research:Newsletter/2024/August

Wikimedia Research Newsletter

Vol: 14 • Issue: 08 • August 2024 [contribute] [archives]

Simulated Wikipedia seen as less credible than ChatGPT and Alexa in experiment

By: Tilman Bayer

Identical text perceived as *less* credible when presented as a Wikipedia article than as simulated ChatGPT or Alexa output

edit

A paper[1] published in Nature's Scientific Reports presents "the results of two preregistered experiments in which [1222 human] participants rated the credibility of accurate versus partially inaccurate information ostensibly provided by a dynamic text-based LLM-powered agent, a voice-based agent, or a static text-based online encyclopedia". These mock-ups (examples, full set) "looked or sounded as similar as possible to the respective real applications" ChatGPT, Amazon Alexa and English Wikipedia, respectively. In the first experiment, this included branding ("Wikipedia. The Free Encyclopedia" etc.), which was removed as part of the second experiment so that the mock-ups "looked and sounded like a generic voice-based agent, a dynamic text-based agent, or a static text-based encyclopedia" instead.

 
"Screenshots of unbranded application mock-ups" (figure 1 from the paper; the versions with brand - "Wikipedia", etc. - are available online)

The brief texts presented were identical across the three mediums. In the Wikipedia case, they were made to resemble the lead section or other parts of a full article. They were generated as answers to

"[...] six questions that related to general knowledge and covered diverse topics8: What do I do when I encounter a wolf? What are the risks of hookah smoking? How many people died when the Titanic sank? What is appendicitis? How many bones are in the human body? Tell me something about the country Slovenia!"

Each participant was randomly assigned to one of the three presentation modes, and shown six texts where

"For half the topics [...] the information was entirely accurate, while for the other half, the information contained several factual inaccuracies and/or internal inconsistencies (i.e., a piece of information within a snippet contradicted another piece of information provided within the same snippet); both error types are known to happen regularly during typical usage of LLMs."

For each text, subjects were asked to rate "the extent to which they perceived the information to be accurate, trustworthy, and believable."

The results might come as an unpleasant surprise to Wikipedians and the Wikimedia Foundation, which has consistently sought to present Wikipedia as a more reliable option over LLM-based tools like ChatGPT (see e.g. "In the media" in the current issue of the Signpost):

"As expected, credibility assessments were overall higher for accurate than for partially inaccurate information. In line with our predictions, we also found that presentation mode influenced credibility assessments in both experiments, with significantly higher credibility for the voice-based agent than the static text-based online encyclopedia. Additionally, in Experiment 1, credibility assessments were significantly higher for the voice-based agent than for the dynamic text-based agent, whereas this difference was not significant in Experiment 2. [...] Importantly, branding did not significantly moderate the effect of presentation mode on perceived information credibility. [... Overall, we] showed that information provided by voice- or dynamic text-based agents is perceived as more credible than information provided by a static-text based online encyclopedia."

 
"Information credibility by presentation mode [and] information accuracy" (figure 2a from the paper)

The researchers note that these results might be influenced by the fact that it is easier to discern factual errors on a static text page like a Wikipedia than when listening to the spoken audio of Alexa or watching the streaming chat-like presentation of ChatGPT:

"The most plausible interpretation for the observed pattern of results appears to be that both a modality effect (i.e., reading vs. listening) and an effect of conversational nature (i.e., conversational vs. non-conversational) work in parallel and in partially opposing ways: discernment between accurate and inaccurate information benefitted from reading (vs. listening) and from being presented in a non-conversational (vs. conversational) way. Because dynamic text-based agents combine both, higher discernment through reading and reduced discernment through the conversational nature, they score between voice-based agents (lower discernment through listening and conversational nature) and static text (higher discernment through reading and non-conversational nature)."

They point out that this interpretation is consistent with another recent experiment that found "no differences in perceived credibility of information between Wikipedia, ChatGPT, and an unbranded, raw text interface when the conversational nature of ChatGPT is made less salient" (see our review: "In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible").

The authors offer us some consolation in form of an additional result (not part of the main, preregistered experiment):

"However, exploratory analyses yielded an interesting discrepancy between perceived information credibility when being exposed to actual information and global trustworthiness ratings regarding the three information search applications. Here, online encyclopedias were rated as most trustworthy, while no significant differences were observed between voice-based and dynamic text-based agents."

Besides information credibility as the experiment's main outcome, participants were also asked to provide ratings about several other aspects. For example, "Social presence" was gauged using questions such as "How much did you feel you were interacting with an intelligent being while reading the information/listening to the information?" Perhaps unsurprisingly, there was "lower perceived social presence for static text-based online encyclopedia entries compared to both voice-based agents and dynamic text-based agents." On the other hand,

"Contrary to our predictions, people felt higher enjoyment [measured using questions like "I found reading the information / listening to the information entertaining"] when information was presented as static or dynamic text compared to the voice-based agent, while the two text-based conditions did not significantly differ. In Experiment 2, we expected to replicate this pattern of results but found that people also felt higher enjoyment with the dynamic text-based agent than the static text."

Other recent publications

edit

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.


Contrary to expectations, higher social integration of a new wiki community does not predict its long-term success

edit

From the abstract:[2]

"We hypothesize that the conditions in which new peer production communities [such as wikis like Wikipedia] operate make communication problems common and make coordination and integration more difficult, and that variation in the structure of project communication networks will predict project success. [...] We assess whether communities displaying network markers of coordination and social integration are more productive and long-lasting. Contrary to our expectations, we find a very weak relationship between communication structure and collaborative performance. We propose that technology [such as wikis] may serve as a partial substitute for communication in coordinating work and integrating newcomers in peer production."

From the paper:

"we test whether early-stage peer production communities benefit from the same sorts of communication network structures as offline groups, using a dataset of 999 wiki communities gathered from Fandom (Wikia) in 2010. We create a network based on communication between members of each wiki and examine how well the structure of these networks predicts (1) how productive community members are in adding content to the wiki and (2) how long the community survives."

"Our findings about the relative unimportance of communication structure, combined with theories of stigmergic communication and coordination, suggest a possible tradeoff between social structure and project structure. When the structure of a project is explicit and tasks are straightforward, as in many early-stage peer production projects, there are few social interdependencies. Many simple coordination tasks can be performed through the wiki itself and thus do not require complex social structures. This theory suggests an explanation for findings in the peer production literature that projects tend to become more structured and hierarchical over time (Halfaker et al., 2013; Shaw & Hill, 2014; TeBlunthuis et al., 2018). In contrast with work groups, the work of a typical peer production project may be simpler in early stages. As projects grow and become more complex, it becomes more difficult to signal needs through the artifact and structured coordination is needed."

"Wikipedia's Race and Ethnicity Gap and the Unverifiability of Whiteness"

edit

From the abstract:[3]

"Although Wikipedia has a widely studied gender gap, almost no research has attempted to discover if it has a comparable race and ethnicity gap among its editors or its articles. No such comprehensive analysis of Wikipedia's editors exists because legal, cultural, and social structures complicate surveying them about race and ethnicity. Nor is it possible to precisely measure how many of Wikipedia's biographies are about people from indigenous and nondominant ethnic groups, because most articles lack ethnicity information. While it seems that many of these uncategorized biographies are about white people, these biographies are not categorized by ethnicity because policies require reliable sources to do so. These sources do not exist for white people because whiteness is a social construct that has historically been treated as a transparent default. [...]. In the absence of a precise analysis of the gaps in its editors or its articles, I present a quantitative and qualitative analysis of these structures that prevent such an analysis. I examine policy discussions about categorization by race and ethnicity, demonstrating persistent anti-Black racism. Turning to Wikidata, I reveal how the ontology of whiteness shifts as it enters the database, functioning differently than existing theories of whiteness account for. While the data does point toward a significant race and ethnicity gap, the data cannot definitively reveal meaning beyond its inability to reveal quantitative meaning. Yet the unverifiability of whiteness is itself an undeniable verification of Wikipedia's whiteness."

"WhatTheWikiFact: Fact-Checking Claims Against Wikipedia"

edit

From the abstract:[4]

"[We present] WhatTheWikiFact, a system for automatic claim verification using Wikipedia. The system can predict the veracity of an input claim, and it further shows the evidence it has retrieved as part of the verification process. It shows confidence scores and a list of relevant Wikipedia articles, together with detailed information about each article, including the phrase used to retrieve it, the most relevant sentences extracted from it and their stance with respect to the input claim, as well as the associated probabilities. The system supports several languages: Bulgarian, English, and Russian."

"Wikipedia context did not lead to measurable performance gains" for LLMs in biomedical tasks

edit

From the abstract:[5]

"We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. [...] QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains."

"LLMs consistently hallucinate more on entities without Wikipedia pages"

edit

From the abstract:[6]

"we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined from user-chatbot conversations in the wild. These generations are then automatically fact-checked against a systematically curated knowledge source collected from web search. Notably, half of these real-world entities do not have associated Wikipedia pages. We evaluate 118,785 generations from 15 LLMs on 7,919 entities. We find that LLMs consistently hallucinate more on entities without Wikipedia pages and exhibit varying hallucination rates across different domains. Finally, given the same base models, adding a retrieval component only slightly reduces hallucinations but does not eliminate hallucinations."

From the "Analysis" section:

"Do models hallucinate more on non-Wikipedia knowledge? We also compare the factuality of LLMs on entities that have Wikipedia pages with those that do not.[...] We observe a significant decrease in WILDFACTSCORE-STRICT when recalling knowledge from sources other than Wikipedia for all eight models, with GPT-3.5 and GPT-4o exhibiting the largest drop. Interestingly, even though [the retrieval-augmented generation-based] Command R and Command R+ models perform web searches, they also exhibit lower factual accuracy when generating information from non-Wiki sources."

"Impact of Generative AI": A "significant decrease in Wikipedia page views" after the release of ChatGPT

edit

From this abstract-only paper presented at last month's Americas Conference on Information Systems (AMCIS):[7]

"Although GenAI tools have made information search more efficient, recent research shows they are undermining and degrading engagement with online question and answer (Q&A)-based knowledge communities like Stack Overflow and Reddit [...]. We extend this stream of research by examining the impact of GenAI on the market value and quality of peer-produced content using [...] Wikipedia, which is different from Q&A-based communities mentioned above. We [...] extend empirical analyses focusing on ChatGPT’s release on November 30, 2022. We collect monthly Wikipedia page views and content (text) data for six months before and after the release date as the treatment group. We then collect data for same months a year before as the control group. The difference-in-difference (DID) analyses demonstrate significant decrease in Wikipedia page views (market value) after the release of ChatGPT. However, we found an increase in the quality of Wikipedia articles as evidenced by a significant increase in verbosity and readability of the articles after ChatGPT release. Our analyses have controlled for betweenness and closeness centrality of the articles, and article, year-month, and article category fixed-effects. We will extend this research by finding the mechanisms underlying the impact of GenAI on online knowledge repositories. Further, we plan to conduct detailed analyses to examine the impact of GenAI on knowledge contributors."

See also our review of a different paper addressing the same question: "ChatGPT did not kill Wikipedia, but might have reduced its growth"

See also in the current issue of the Signpost: "AI policy positions of the Wikimedia Foundation"

References

edit
  1. Anderl, Christine; Klein, Stefanie H.; Sarigül, Büsra; Schneider, Frank M.; Han, Junyi; Fiedler, Paul L.; Utz, Sonja (2024-07-25). "Conversational presentation mode increases credibility judgements during information search with ChatGPT". Scientific Reports 14 (1): 17127. ISSN 2045-2322. doi:10.1038/s41598-024-67829-6.  Preregistration, experiment materials
  2. Foote, Jeremy; Shaw, Aaron; Hill, Benjamin Mako (2023-05-01). "Communication networks do not predict success in attempts at peer production". Journal of Computer-Mediated Communication 28 (3): –002. ISSN 1083-6101. doi:10.1093/jcmc/zmad002. 
  3. Mandiberg, Michael (2023-03-01). "Wikipedia's Race and Ethnicity Gap and the Unverifiability of Whiteness". Social Text 41 (1): 21–46. ISSN 0164-2472. doi:10.1215/01642472-10174954.   , freely available archived version
  4. Chernyavskiy, Anton; Ilvovsky, Dmitry; Nakov, Preslav (2021-10-30). "WhatTheWikiFact: Fact-Checking Claims Against Wikipedia". Proceedings of the 30th ACM International Conference on Information & Knowledge Management. CIKM '21. New York, NY, USA: Association for Computing Machinery. pp. 4690–4695. ISBN 9781450384469. doi:10.1145/3459637.3481987.    / Preprint version: Chernyavskiy, Anton; Ilvovsky, Dmitry; Nakov, Preslav (2021-10-10), WhatTheWikiFact: Fact-Checking Claims Against Wikipedia, arXiv, doi:10.48550/arXiv.2105.00826 
  5. Ateia, Samy; Kruschwitz, Udo (2024-07-18), Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks, arXiv, doi:10.48550/arXiv.2407.13511 
  6. Zhao, Wenting; Goyal, Tanya; Chiu, Yu Ying; Jiang, Liwei; Newman, Benjamin; Ravichander, Abhilasha; Chandu, Khyathi; Bras, Ronan Le; Cardie, Claire; Deng, Yuntian; Choi, Yejin (2024-07-24), WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries, arXiv, doi:10.48550/arXiv.2407.17468 
  7. Singh, Vivek; Velichety, Srikar; Li, Sen (2024-08-16). "Impact of Generative AI on the Value of Peer Produced Content - Evidence from Wikipedia". AMCIS 2024 TREOs.  (abstract only)

Wikimedia Research Newsletter
Vol: 14 • Issue: 08 • August 2024
About • Subscribe: Email      [archives][Signpost edition][contribute][research index]