Research:Large Language Models (LLMs) Impact on Wikipedia's Sustainability

Created
14:53, 23 July 2024 (UTC)
Duration:  2024-July – April-2025
LLMs, AI, Wikipedia

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Purpose

edit

The study aims to explore more how Large Language Models (LLMs) are trained on Wikipedia, and how their use in AI-powered chatbots, such as OpenAI's ChatGPT, Microsoft's Copilot, and Google's Gemini affects the sustainability of Wikipedia as a crowed-sourced project, while also raising concerns about information literacy and exploitative digital labor.

Brief background

edit

"Wikipedia, as a collaboratively edited and open-access knowledge archive, provides a rich dataset for training Artificial Intelligence (AI) models (Deckelmann, 2023; Schaul et al., 2023; McDowell, 2024) and enhancing data accessibility. However, reliance on this crowd-sourced encyclopedia raises ethical concerns regarding data provenance, knowledge production and curation, and digital labor. This research critically examines the use of Wikipedia as a training set for Large Language Models (LLMs), focusing on the ethical implications for data integrity, information accessibility, and cultural representation. Drawing from critical data studies (Boyd & Crawford, 2012; Iliadis & Russo, 2016), feminist posthumanism (Haraway, 1988, 1991), and recent critiques of Wikidata’s ethics (McDowell & Vetter, 2024; Zhang et al., 2022), this study explores potential biases and power dynamics in Wikipedia’s data curation processes and its use in LLMs. A mixed-methods approach is employed, including content analysis of case studies where LLMs have been trained on Wikipedia and interviews with key stakeholders such as computer scientists, journalists, and Wikimedia Foundation staff.

Methods

edit

This IRB-approved study (Log no: 24-072-IUP) employed a problem-centered expert interview (Döringer 2020) to investigate a complex, current issue - namely, the relationship between Wikipedia and Large Language Models (LLMs) as it pertains to issues of sustainability, information access and literacy, problematic information, and ethical concerns. Problem-centered expert interviews, according to Döringer (2020), involve a combination of two long- standing approaches to qualitative research, namely the broader “theory-generating expert interview” (Bogner and Menz 2009, 2018) and the “problem-centered interview” (Murray 2016; Shirani 2015; Witzel 1982, 2000). Döringer (2020) argues that “the combination of these epistemological perspectives serves as a promising starting point for moving beyond the experts’ role as representatives and taking into account their personal opinions and experiences” (p. 269). Important to the problem-centered expert interview, for Döringer (2020), are seven (7) features and/or processes: 1) the definition and discussion of the meaning of “expert,” 2) distinguishing “different types of expert knowledge,” 3) the goal of “inductive theory development,” 4) emphasis on “individual perspective,” 5) the use of a “specific interview design and set of questions,” 6) the capacity for comparing results, and 7) the introduction of “inductive-deductive theory building” as shown in Table 1. Table 1: Elements of the problem-centred expert interview (Döringer, 2020) Theory-generating expert interview Problem-centred interview (PCI) Defines and discusses the term ‘expert’ Highlights the individual perspective Distinguishes different types of expert knowledge Provides a specific interview design and set of questions Aims at inductive theory development Enables comparability of gathered data Proposes inductive-deductive theory building For the purposes of our study, we use the term “expert” to indicate an individual with highly-specialized knowledge and interest in the overlapping relationships between Wikipedia (already a very specialized subject in and of itself) and LLMs. Yet, we also differentiate between varying types of expert knowledge, acknowledging that both the interaction and relationships of Wikipedia with LLMs may be understood across various domains (computer science, natural language processing, yes, but also law, economics, and new media studies). Accordingly, expert knowledge on the topic may be expressed as it concerns technical, social, cultural, economic, educational or other dimensions, given the rapid acceleration of LLMs and their impact on a broader number of concerns. The interview instrument (Appendix A) was comprised of eight (8) interview questions, and the procedure itself was semi-structured to allow for follow-up questions and/or side discussions among interviewee and researchers. In limiting the number of participants to six (6), this study sought to enable comparability of the gathered data, in order to note when and where experts independently converged (or diverged). The methodology overall enabled both deductive and inductive theory building on topics related to the intersection of LLMs and Wikipedia, while providing qualitative date to support and contextualize previous research (Anderl et al. 2024; Ashkinaze et al. 2024; McDowell 2024; Huang and Chang 2023; Huang and Siddarth 2023). 4.2 Recruitment Expert participants for this study (N=6) were invited to participate based on the researchers’ previous knowledge of their professional work and expertise in machine learning, Wikimedia projects, LLMs, and data science. Accordingly, all participants were both highly educated and well-versed on the issues at hand, with particular insights and/or insider knowledge. While the majority of experts were e-mailed directly to solicit an interview, a few responded to a broader call for participants posted to the Wikimedia research listserv (wiki-research-l@lists.wikimedia.org). As part of the informed consent process, and to best accommodate their professional schedules, prospective participants were given the option of a synchronous video conference interview (conducted in Zoom) or an asynchronous format conducted via email and shared cloud document (Google docs). Participants were split evenly in half in the selection of asynchronous and synchronous formats. Participants did not receive any incentive to participate in this study.

Interview

edit

The study will utilize semi-structured interviews conducted via Zoom videoconferencing or email, depending on participant preference. Participants will be asked questions about their understanding of the relationship between Wikipedia and Large Language Models like ChatGPT. Interviews will last approximately 30–60 minutes, depending on the depth of responses. Total participation, including email communication and the IRB consent process, will be under 90 minutes. The IRB consent form will be sent to participants as part of the recruitment email. Zoom video recordings will be stored in the PI's institutional account, password-protected, and will automatically expire after 120 days. Only the interview transcripts will be downloaded, and this will be stored, password-protected on the PI's personal computer.

Instruments

edit
  1. Informed consent: https://docs.google.com/document/d/1vcO5zZEcZs4a37O1XvVIsSU76SWA_oxG/edit?usp=sharing&ouid=113182016009423657566&rtpof=true&sd=true
  2. Interview questions :https://docs.google.com/document/d/1SKItfnX0MHQHb0sl2N2tJsx2sPWO4LNRzpr5kYi9TkU/edit?usp=sharing

Subject selection

edit

Subject selection is based on the PI's knowledge of individuals who work at or have expertise in the intersection of Wikipedia and Large Language Models. These individuals include data scientist, computer scientists, journalists, and product designers at Wikimedia Foundation. The PI will email each potential participant individually to ask if they are willing to participate in the interview study. An informed consent document will be included in the same email.

Participant inclusion criteria

edit

Our inclusion criteria includes:

1) Computers scientists, researchers, product designers, or journalists with prior experience or insight into Large Language Models (LLMs), machine learning or Wikipedia/media.

2) English speaking participants.

Timeline

edit

Interviews: July 25 - August 15, 2024 Analysis: August 15 - September 5 Drafting: August 20 - Sept 16 Article submission: Sept. 16 Article revision: November, 2024 - January, 2025

Policy, Ethics and Human Subjects Research

edit

THIS PROJECT HAS BEEN APPROVED BY THE INDIANA UNIVERSITY OF PENNSYLVANIA INSTITUTIONAL REVIEW BOARD FOR THE PROTECTION OF HUMAN SUBJECTS (PHONE 724-357-7730).

Confidentiality and privacy

edit

Research subjects will have the option to remain anonymous or be named in the research article produced as part of this study. If a subject wishes to be named, they will be identified by their name and professional title in the article.

If a subject prefers not to be named, they will have the opportunity to choose a pseudonym and will be described by their profession (e.g., a data scientist working for a major tech company).

During the data collection and analysis process, all subjects' identities will be kept confidential. The PI will know their identities, but this information will not be disclosed to anyone outside the research project. Data will be collected via Zoom, and recordings will be stored in the Zoom cloud for 120 days with password protection. The recordings will not be retained beyond this period. If the IRB requires data to be stored for the typical five years, it will be downloaded to the PI's password-protected personal laptop. The transcript of the Zoom session will be edited to remove participants' names and will also be stored on the PI's password-protected laptop.

For subject choosing the email interview format, their data will remain in the password-protected email client until the end of the study, at which it will be deleted.

Results

edit

Key finding 1: Wikipedia plays a significant role in the training of LLMs, but the exact process and value it's value remain unclear.

Interviewees unanimously agree that Wikipedia significantly contributes to the training and fine-tuning of Large Language Models (LLMs) (E1-E6). Many experts emphasized that Wikipedia content is likely given more weight (value) during the training process. For instance, the research participants noted that Wikipedia's open license and perceived quality likely grant it greater weight during the training process. For instance, participants noted that Wikipedia is a central component of the datasets that underpin popular models like ChatGPT and Gemini. As one expert stated, “My understanding is that Wikipedia is intentionally given a much higher rate than many other sources. Wikipedia probably unintentionally gets an even higher weight because it’s actually copied inside the web corpus several times” (E1). This prominence, coupled with its widespread availability, could lead to Wikipedia's overrepresentation in LLM training data (E1). Using a vivid metaphor, another interviewee described Wikipedia's integration into the vast corpus of training data for LLMs: “The popular, non-technical analogy is that the training data for an LLM is like a giant hairball. Wikipedia becomes part of the hairball because it is openly licensed content” (E2). This implies that LLMs do not differentiate between information sourced from Wikipedia and other sources, complicating users' ability to trace the origin of generated information.

While Wikipedia is undoubtedly valuable, its predominant use in model training without clear attribution highlights the need for greater transparency regarding how LLMs manage and prioritize various sources. Another expert pointed out how Wikipedia content is processed before being fed into LLMs: “The content of Wikipedia is surely being ‘cleaned’ (of some metadata) and fed into the language models that underlie ChatGPT and Gemini” (E4). As a curated source, Wikipedia can be optimized for language models by removing irrelevant metadata, enhancing its suitability for training. However, the lack of transparency about data processing raises concerns regarding the information fed into LLMs (E4). Other experts noted that, while the exact process remains unclear, a general procedure can be speculated: “In practical terms, generally…what people are doing is they're throwing huge amounts of corpus at these models and then trying to…clean up and redirect it afterwards. So I would suspect that they would throw the entire corpus of Wikipedia at the model, but then they might tune based on…quality assessments. But yeah, it's hard for me to say, because they don't…generally communicate about these things. But in theory, this should be likely and effective” (E3). This comment suggests that Wikipedia's content might be prioritized during later stages of model refinement due to its quality standards. However, the overall opacity surrounding LLM training practices leaves many uncertainties..

Key finding 2: LLMs act as intermediaries between users and original knowledge sources, often reducing information quality and perpetuating biases, while lacking transparency and proper citation. Although not all expert interviewees used the term “dis/intermediation,” they discussed how LLMs function as intermediaries between end users and original knowledge sources, negatively impacting information access and literacy (E1-E6). One expert succinctly compared LLMs to Google’s knowledge graph, stating, “LLM applications bring even stronger (dis-)intermediation than the Google Knowledge Panel because they are heavily customized to the question being asked” (E1). This disintermediation can lead to Wikipedia being bypassed altogether, resulting in diminished information quality, whether through oversimplification via shortened summaries or more concerning inaccuracies.

LLMs are particularly prone to “hallucinations,” generating plausible-sounding but inaccurate or unverified information. As one expert noted, “The amount of misinformation coming into the system through this channel is considerably higher than it used to be” (E1). The risks associated with misinformation, compounded by a lack of direct source access, raise questions about the reliability of knowledge produced by LLMs.

Furthermore, this disintermediation exacerbates the gap between the source of information and original research. While LLMs can provide answers to user queries, they often fail to cite their sources: “LLMs often do not cite a source in their responses. Without provenance, it is difficult for the user to determine the veracity of the information” (E2). LLMs trained on Wikipedia might answer a query, but users have no access to the original source of information, the secondary source cited in Wikipedia, or even Wikipedia itself, which acts as a tertiary source. Consequently, LLMs serve as quaternary sources, three steps removed from the original production of knowledge.

This distance is illustrated as a concerning gap between original sources and consumption: “The distance between the source, both in computing technology and original research, and its consumption is like a concerning gap” (E3). LLMs, especially those trained on publicly available tertiary content like Wikipedia, can negatively impact information accuracy and further disintermediate users from the original knowledge creation process (E3). This lack of citation may threaten users’ ability to critically evaluate the origin or accuracy of the information, creating a barrier between Wikipedia users and knowledge.


Key Finding 3: Wikipedia’s sustainability is threatened by LLMs’ negative impact on the digital commons, Wikipedia discoverability, community engagement, and disintermediation.

If LLMs are acting as intermediaries and directing traffic away from the actual encyclopedia (while relying on training data from the encyclopedia), how might development affect Wikipedia’s long term sustainability? To address this, we also asked our interviewees about the challenges that LLMs and their applications might pose to Wikipedia’s long-term sustainability and maintenance.

The interviewees expressed concerns about the sustainability of Wikipedia in the age of LLMs having negative impact on the digital commons, discoverability, community participation and engagement (attracting new editors), and disintermediation. The risk of a shrinking open environment could isolate Wikipedia and hinder its collaborative nature (E5). Users may rely on LLMs for quick consultations, bypassing Wikipedia and reducing opportunities for content improvement and community engagement (E5). All of the interviewees warn that LLMs could diminish the discoverability of Wikipedia, leading to decreased donations and editorial contributions (E1-E6). One expert recommended Wikipedia should position itself as a crucial resource for training LLMs as a way to attract new contributors, but also noted the risk of LLMs overshadowing human-generated content (E2). Additional emphasis was placed on the importance of maintaining Wikipedia’s feedback loop, where readers become contributors, and caution against tools that replace rather than support Wikipedians (E3). Because disintermediation could undermine the motivation for community engagement, there is a need for targeted outreach via WikiProjects and campaigns in fostering a diverse and engaged editor community (E6). The same expert also stressed the necessity of making sources easier to work with to ensure high-quality content and suggested integrating AI-supported content with traditional human-written content to enhance accessibility (E6). Ultimately, he sustainability of Wikipedia depends on continuous experimentation and technical support to adapt to the evolving digital landscape as it is disrupted by emerging generative AI and LLM tools (E1, E6). A related danger, though only expressed by one participant, is the potential for a competitor to emerge, using LLMs to create personalized content, thereby drawing users away from Wikipedia and undermining its foundational community. Wikipedia’s unique, non-profit model is crucial for its survival, as it deters commercial competitors from attempting to replace it (E1). Once lost, Wikipedia’s collaborative and comprehensive knowledge base would be nearly impossible to recreate, given the historical and communal efforts that built it (E1). This underscores the importance of maintaining Wikipedia’s role as a primary knowledge source to prevent the erosion of its community and the valuable content it provides.

Key finding 4: The use of Wikipedia as LLM training data involves ethical problems related to contributor expectations, the risk of depleting the commons, and exacerbation of linguistic and cultural inequities.

Interview participants were asked to respond to the following questions regarding ethical concerns: “In your opinion, what ethical problems or issues, if any, emerge in terms of the relationship between Wikipedia and its use as training data for LLMs?” All but one interviewee agreed that this relationship constituted an ethical problem, and responses were categorized in the following themes: contributor expectations, risks to the digital commons, and linguistic and cultural inequities. There is agreement among expert interviewees that Wikipedia contributors never intended for their content to be used by machine learning models (E2, E4). “The fundamental problem, as one expert puts it, “is that users that would have been quite happy to provide their content to other humans, are not necessarily happy to have their content fed to [machine learning] model. That is, when determining licensing rights, it seems that the current body of law makes the glaring omission of not mentioning, in the license, the expected and intended audience, at the time, for the licensing.” (E4). Another interviewee echoes this sentiment, noting that many Wikipedians feel it is unfair that their unpaid work is used by big tech companies to generate profit: “The ethical problem that I hear about most frequently from Wikipedians is that the situation doesn’t seem fundamentally fair. The editors produce this content without compensation, it is openly licensed, and then these big tech companies make so much money from LLMs.” (E2).

Another central concern among our EIs is that the overuse of digital commons content for training LLMs can deplete the commons by exhausting available resources and discouraging contributors who feel their work is exploited without recognition or compensation (E2, E5). The current AI race, with multiple tech companies competing to develop and fine-tune LLMs further exacerbates this issue, as does the fact that there has been no attention to reciprocity (or giving back to) the commons (E5). There is an ethical obligation to give back to the commons proportionately to what is extracted, stressing the importance of maintaining the sustainability of these shared resources (E5). Other participants concur with the need for giving back, suggesting that human-generated content will become increasingly valuable as it becomes rarer (E2) .

Expert interviewees also expressed significant concerns regarding the ethical implications of LLMs on linguistic and cultural (in)equities, especially when it comes to access and representation (E1, E3, E5, E6) . Because Wikipedia already relies on and extends English as a dominant language, training LLMs on this data highlights the risk of exacerbating existing gaps in access to technology and the Internet, particularly for speakers of less dominant languages. LLMs are limited in multiple languages due to the high costs of running these models, which raises questions about scalability and inclusivity (E5). LLMs, like Wikipedia, rely heavily on digitized documents, which exist mostly in dominant languages (E3). This reliance can marginalize cultures with less digital documentation, potentially leading to cultural erasure (E3). As one expert states, “[T]here's a concern around equity — leaving people behind or forcing people to [use] languages that [are not their] native languages. They are the languages of the colonizers.” (E5). To make matters worse, LLMs perform well with widely documented languages but struggle with less common ones, further entrenching systemic biases (E3). Ultimately, the language modeling community urgently needs to address these challenges to prevent long-term consequences and ensure broader language coverage and representation (E6).

Key Finding 5: Ethical concerns may be partially addressed via systemic changes to market incentives and license models, financial contributions to Wikipedia from big tech, and technical solutions related to data provenance and attribution.

While the existence of ethical issues as it relates to Wikipedia being used as a training data was not agreed upon unanimously, a majority of experts both identified ethical issues and proposed possible solutions to address such issues, proposing a variety of fixes related to licensing, market incentives, LLM explainability, and data provenance. On a broader scale, there is a need for a radical rethinking of market incentives and licensing models to ensure the sustainability of the digital commons (E5). One expert references Larry Lessig’s work on redesigning market incentives (Lessig, 2022), arguing that profit maximization should not be the sole reward mechanism (E5). Wikipedia has long thrived on the altruism of its volunteer contributors, but that model is endangered by the LLM economy in which digital commons content is extracted and exploited beyond the expectations of its original creators, and without respecting CC-BY-SA licensing. In contrast to this emphasis on market incentives, another interviewee calls for immediate financial contributions from big tech as a necessary step to support the commons (E2). “Big tech should contribute to the project,” this expert notes, “but it is very important that big tech does not itself have any editorial influence.” (E2).

The role of Wikipedia in this context is also a point of contention. While Wikipedia can contribute to the broader open-source movement, it is not solely responsible for solving the open culture challenge (E5). An online encyclopedia’s primary role is not to address these issues of open source and open culture, although it can play a supportive role (E5). One way that LLMs might address issues related to information literacy loss among users, for example, is the addition of explainability measures, which was frequently referenced by one expert. Such explainability would ensure that LLMs explain to users how and where they retrieved certain information or outputs (E3). Noting the opportunities in training LLMs to express “chain of thought”, this expert expressed how LLMs might showcase “processes that would probably look very familiar to Wikipedia and information literacy processes.” (E3). If developers focused less on the speed of outputs and emphasized “quality and information literacy instead,” we might end up with a model that is “able to talk to you about what it’s doing and what it’s thinking.” (E3). Finally, technical solutions related to data provenance and attributions, such as ensuring LLMs include citations, are necessary to maintain the integrity of the commons (E2). Going forward, human-generated content will be considered even more valuable as it becomes increasingly rare (E2). While there is a shared concern about the depletion of the commons and the need for giving back, the participants differ in their approaches to addressing these issues, with some experts advocating for systemic changes to market incentives and licensing models, while others emphasize immediate financial contributions and technical solutions to maintain the integrity of the commons.

Key Finding 6: Systemic biases in LLMs, which can be inherited from sources like Wikipedia, are inevitable, but can be mitigated via proactive efforts to diversify communities and content in the digital commons.

To explore the possibility of systemic biases in LLMs, we asked expert interviewees whether they believe these biases, which have been observed in Wikipedia, could also manifest in LLMs, and if they could provide any examples of such occurrences. EIs collectively highlighted the pervasive issue of systemic biases in LLMs and their potential perpetuation from sources like Wikipedia (E1-E6). As one expert stated, “Yes, there is a risk of these systemic biases being perpetuated in LLMs. To the extent there are systemic biases in Wikipedia, or the broader media landscape, then it is likely that the LLMs will be trained on these same biases” (E6).

While the encyclopedia itself has improved (and can continue to improve) by proactive efforts by Wikipedia editors to address bias through dedicated task forces, there are inherent biases due to limited content in various languages (E5). Such biases are inevitable, reflecting the human biases of contributors, and actively including more diverse communities can mitigate these effects (E5). Additional EIs affirm the risk of systemic biases in LLMs (E2, E4), with one pointing out that these biases are likely to be inherited from the media landscape (E2). Another expert discussed the dominance of Western documentation practices in Wikipedia, which can marginalize non-Western knowledge systems, and underscores the need for diverse sources to avoid cultural erasure. As noted by this expert, “Western culture has really strongly adopted this whole documentary practice around knowledge, and that fits with Wikipedia. But there’s all sorts of knowledge all around the world that aren’t documented in familiar ways, or maybe aren’t documented.” (E3). The issue of language diversity further compounds the issue: “A lack of language coverage (and therefore perspectives from these other language communities) is probably the most concerning aspect of bias to me with these models.” (E6). Despite the fact that Wikipedia does better than much of the internet in offering multilingual content, significant linguistic gaps still exist on Wikipedia, especially in underrepresented languages and communities. This lack of linguistic diversity in Wikipedia is mirrored in LLMs, which are disproportionately trained on dominant languages with a lack of representation of non-Western knowledge systems (E6). Finally, one of the most obvious examples of systemic biases in LLMs are issues in translation systems (E1). Overall, biases in training data are almost certain to appear in LLMs unless explicit efforts are made to counteract them (E1).

Resources

edit

Provide links to presentations, blog posts, or other ways in which you disseminate your work.

References

edit
  • boyd d, Crawford K (2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15:662–679. https://doi.org/10.1080/1369118X.2012.678878
  • Evenstein Sigalov S, Nachmias R (2017) Wikipedia as a platform for impactful learning: a new course model in higher education. Education and Information Technologies 22(6): 2959–2979.
  • Ford H (2022) Writing the revolution: Wikipedia and the survival of facts in the digital age. The MIT Press, Cambridge, Massachusetts
  • McDowell ZJ (2024) Wikipedia and AI: Access, representation, and advocacy in the age of large language models. Convergence: The International Journal of Research into New Media Technologies 30:751–767. https://doi.org/10.1177/13548565241238924
  • McDowell ZJ, Vetter MA (2021) Wikipedia and the Representation of Reality. 1st edition. New York, NY: Routledge.
  • McDowell Z, Vetter M (2022b) Fast “truths” and slow knowledge; oracular answers and Wikipedia’s epistemology. Fast Capitalism 19(1): 104–112.
  • McDowell Z, Vetter M (2024) The Re-alienation of the commons: Wikidata and the ethics of “free” data. International Journal of Communication 18: 590–608.