Jump to content

Research:Large Language Models (LLMs) Impact on Wikipedia's Sustainability

From Meta, a Wikimedia project coordination wiki
Created
14:53, 23 July 2024 (UTC)
Duration:  2024-July – April-2025
LLMs, AI, Wikipedia

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Purpose

[edit]

The purpose of this study is to learn more about how Large Language Models (LLMs) are trained on Wikipedia, and how their uses in AI-powered chatbots such as OpenAI's ChatGPT, Microsoft's CoPilot, or Google's Gemini impact the sustainability of Wikipedia as a crowd-sourced project and introduce issues related to information literacy and exploitative digital labor.

Brief background

[edit]

Wikipedia, as a collaboratively edited and open-access knowledge archive, provides a vast and rich dataset for training Artificial Intelligence (AI) applications and models (Deckelmann, 2023; Schaul et al., 2023; McDowell, 2024) and making the data within the encyclopedia more accessible. However, such reliance on the crowd-sourced encyclopedia introduces numerous ethical issues related to data provenance, knowledge production and curation, and digital labor. This research critically examines the use of Wikipedia as a training set for Large Language Models (LLMs) specifically, addressing the implications of this practice on data ethics, information accessibility, and cultural representation. Drawing on critical data studies (boyd & Crawford, 2012; Iliadis & Russo, 2016), feminist posthumanism (Haraway, 1988, 1991), and recent critical interrogations of Wikidata’s ethics (McDowell & Vetter, 2024; Zhang et al., 2022) this study explores the potential biases and power dynamics embedded in the data curation processes of Wikipedia and its subsequent use in LLMs. Our research employs a mixed-methods approach, including content analysis of specific case studies where LLMs have been trained using Wikipedia, and interviews with key stakeholders including computer scientists, journalists, and WMF staff.

Methods

[edit]

Interview

[edit]

The method of study will be a semi-structured interview taken place over Zoom videoconferencing software OR email exchange.The interview subjects will be asked questions related to their understanding of the relationship between Wikipedia and Lage Language Models like ChatGPT. Interviews will last approximately 30-60 minutes depending on the subjects' responses. Total participation in the study will be under 90 minutes with e-mail communication and IRB consent included.IRB Consent form will be emailed to the subjects as part of the recruitment email. Zoom video recordings will be stored in PI insitutional account/password protected and allowed to expire after 120 days. Only the transcript will be downloaded and that will be password-protected on the PI's laptop computer.

Instruments

[edit]
  1. Informed consent: https://docs.google.com/document/d/1vcO5zZEcZs4a37O1XvVIsSU76SWA_oxG/edit?usp=sharing&ouid=113182016009423657566&rtpof=true&sd=true
  2. Interview questions :https://docs.google.com/document/d/1SKItfnX0MHQHb0sl2N2tJsx2sPWO4LNRzpr5kYi9TkU/edit?usp=sharing

Subject selection

[edit]

Subject selection is guided by the PI's knowledge of individuals who either work at or have knowledge of the intersection of Wikipedia and Large Language Models. These individuals are data and computer scientists, journalists, and product designers at Wikimedia foundation. Each potential subject will be e-mailed individually by the PI and asked whether they would be willing to participate in the interview study. They will be provided an informed consent document in the same email.

Participant inclusion criteria

[edit]

Our inclusion criteria include 1) Computers scientists, researchers, product designers, or journalists with some previous experience and insight into LLMs, machine learning, Wikipedia/media. 2) English speaking participants

Timeline

[edit]

Interviews: July 25 - August 15, 2024 Analysis: August 15 - September 5 Drafting: August 20 - Sept 16 Article submission: Sept. 16 Article revision: November, 2024 - January, 2025

Policy, Ethics and Human Subjects Research

[edit]

THIS PROJECT HAS BEEN APPROVED BY THE INDIANA UNIVERSITY OF PENNSYLVANIA INSTITUTIONAL REVIEW BOARD FOR THE PROTECTION OF HUMAN SUBJECTS (PHONE 724-357-7730).

Confidentiality and privacy

[edit]

Research subjects will be given the choice to remain anonymous or named in the research article created as part of this study. If research subject wishes to be named, they will be identified by their name and professional work title as part of the produced research article.

If the subject does not want to be named in the research article, they will be given the chance to choose a pseudonym and described by their profession (i.e. a data scientist working for a major tech company)/

During the data collection and analysis process, all subjects' identities will be held confidential in that the researcher PI will know their identity, but their identity will not be identifiable to others outside the research project. Data storage will take place via Zoom and the recording will be stored in the Zoom cloud for 120 days and protected by password. The Zoom recording will not be kept past 120 days. If the IRB requires that data be kept the typical 5 years, it will be downloaded to the PI's personal laptop which is password protected. Transcript for the Zoom session will be downloaded and edited to remove the participant's name. It will also be stored on the PI's password-protected laptop computer.

If the subject chooses an email interview format, their data will be left in the password-protected email client until the end of the study at which point it will be deleted.

Results

[edit]

Key finding 1: Wikipedia plays a significant role in the training of LLMs, but the exact process and value it is given is unclear.

There is a clear consensus among the interviewees that Wikipedia plays a significant role in training and fine-tuning LLMs (E1-E6). Many expert participants emphasized that due to its open license and perceived quality, Wikipedia content is likely given more weight (value) during the training process. For instance, the research participants noted that Wikipedia constitutes a central part of the dataset that underpins popular models like ChatGPT and Gemini. Wikipedia may be weighted more heavily than other sources in LLM training, intentionally and unintentionally (E1). According to one expert, “My understanding is that Wikipedia is intentionally given a much higher rate than many other sources. Wikipedia probably unintentionally gets an even higher weight because it’s actually copied inside the web corpus several times likely.” (E1). The prominence of Wikipedia as a training source, combined with its widespread availability across different online platforms, could lead to its overrepresentation in LLM training data (E1). Using a vivid metaphor, another interviewee drew attention to the way in which Wikipedia is integrated into the vast corpus of training data for LLMs: “The popular, non-technical analogy is that the training data for an LLM is like a giant hairball. Wikipedia becomes part of the hairball because it is openly licensed content.” (E2) His comment implies that LLMs do not distinguish whether a piece of information originates from Wikipedia or another source, which complicates the user’s ability to trace the origin of the information generated by the model.

While Wikipedia is undoubtedly a valuable resource, its predominant use in model training without clear attribution suggests the need for more transparency in how LLMs handle and prioritize various sources. Another expert highlighted how Wikipedia content is processed before being fed into LLMs: “The content of Wikipedia is surely being ‘cleaned’ (of some metadata) and fed into the language models that are at the basis of ChatGPT and Gemini.” (E4). As a curated and structured source, Wikipedia can be optimized for language models by removing irrelevant metadata, which makes it more suitable for training. However, the lack of transparency about how data is being processed raises questions about the information being fed into LLMs (E4). Other experts also noted that although the exact process is unclear, a general procedure can be speculated: “[I]n practical terms, generally…what people are doing is they're throwing huge amounts of corpus at these models and then trying to…clean up and redirect it afterwards. And so I would suspect that they would throw the entire corpus of Wikipedia at the model. But then they might tune based on…quality assessments…. But yeah, hard like hard for me to say, because they don't…generally communicate about these things. But in theory this should, should this…be likely and effective.” (E3). This comment suggests that Wikipedia’s content might be prioritized during later stages of model refinement, due to its quality standards, but the overall opacity surrounding LLM training practices leaves uncertainties.

Key finding 2: LLMs act as intermediaries between users and original knowledge sources, often reducing information quality and perpetuating biases, while lacking transparency and proper citation.

Although not all of our expert interviewees used the term ‘dis/intermediation,’ they all discussed how LLMs act as intermediaries between end users and original knowledge sources, negatively impacting both information access and information literacy (E1-E6). Referencing the function of Google’s knowledge graph in Google search, one expert made a succinct point in saying that “LLM applications bring even stronger (dis-)intermediation than the Google Knowledge Panel because they are heavily customized to the question being asked.” (E1). Such dis/intermediation can mean that Wikipedia is bypassed altogether due to LLMs, but also that the information quality itself suffers, whether that be in terms of simplification via a shortened summary or more problematic inaccuracy. LLMs are prone to “hallucinations,” where LLMs generate plausible-sounding but inaccurate or unverified information: “It seems that the amount of misinformation coming into the system through this channel is considerably higher than it used to be.” (E1). The risks of misinformation, compounded by the lack of direct access to sources, raise questions about the reliability of knowledge produced by LLMs. On a broader level, this disintermediation also widens the already distant gap between the source of information and the original research. LLMs can provide answers to user queries but fail to offer transparency about their sources: “LLMs often do not cite a source in their responses. Without provenance, it is difficult for the user to determine the veracity of the information.” (E2). LLMs trained on Wikipedia might provide an answer to a query, but the user has no access to the original source of information, the secondary source cited in Wikipedia, or even Wikipedia itself which would already act as a tertiary source. In this way, LLMs serve as quaternary sources, three times removed from the original production of the information or knowledge. This distance is further explained as a gap between original source and consumption: “The distance between the source, both in the cases of computing technology as well as original research, and the consumption of it is like a concerning gap.” (E3). LLMs, especially those trained on publicly available, tertiary content like Wikipedia, can both negatively impact the accuracy of information and further disintermediate users from the original knowledge creation process (E3). This lack of citation may pose a threat to the user’s ability to critically evaluate the origin or accuracy of the information and thus create a barrier between the Wikipedia user and the knowledge.

Key Finding 3: Wikipedia’s sustainability is threatened by LLMs’ negative impact on the digital commons, Wikipedia discoverability, community engagement, and disintermediation.

If LLMs are acting as intermediaries and directing traffic away from the actual encyclopedia (while relying on training data from the encyclopedia), how might development affect Wikipedia’s long term sustainability? To address this, we also asked our interviewees about the challenges that LLMs and their applications might pose to Wikipedia’s long-term sustainability and maintenance.

The interviewees expressed concerns about the sustainability of Wikipedia in the age of LLMs having negative impact on the digital commons, discoverability, community participation and engagement (attracting new editors), and disintermediation. The risk of a shrinking open environment could isolate Wikipedia and hinder its collaborative nature (E5). Users may rely on LLMs for quick consultations, bypassing Wikipedia and reducing opportunities for content improvement and community engagement (E5). All of the interviewees warn that LLMs could diminish the discoverability of Wikipedia, leading to decreased donations and editorial contributions (E1-E6). One expert recommended Wikipedia should position itself as a crucial resource for training LLMs as a way to attract new contributors, but also noted the risk of LLMs overshadowing human-generated content (E2). Additional emphasis was placed on the importance of maintaining Wikipedia’s feedback loop, where readers become contributors, and caution against tools that replace rather than support Wikipedians (E3). Because disintermediation could undermine the motivation for community engagement, there is a need for targeted outreach via WikiProjects and campaigns in fostering a diverse and engaged editor community (E6). The same expert also stressed the necessity of making sources easier to work with to ensure high-quality content and suggested integrating AI-supported content with traditional human-written content to enhance accessibility (E6). Ultimately, he sustainability of Wikipedia depends on continuous experimentation and technical support to adapt to the evolving digital landscape as it is disrupted by emerging generative AI and LLM tools (E1, E6). A related danger, though only expressed by one participant, is the potential for a competitor to emerge, using LLMs to create personalized content, thereby drawing users away from Wikipedia and undermining its foundational community. Wikipedia’s unique, non-profit model is crucial for its survival, as it deters commercial competitors from attempting to replace it (E1). Once lost, Wikipedia’s collaborative and comprehensive knowledge base would be nearly impossible to recreate, given the historical and communal efforts that built it (E1). This underscores the importance of maintaining Wikipedia’s role as a primary knowledge source to prevent the erosion of its community and the valuable content it provides.

Key finding 4: The use of Wikipedia as LLM training data involves ethical problems related to contributor expectations, the risk of depleting the commons, and exacerbation of linguistic and cultural inequities.

Interview participants were asked to respond to the following questions regarding ethical concerns: “In your opinion, what ethical problems or issues, if any, emerge in terms of the relationship between Wikipedia and its use as training data for LLMs?” All but one interviewee agreed that this relationship constituted an ethical problem, and responses were categorized in the following themes: contributor expectations, risks to the digital commons, and linguistic and cultural inequities. There is agreement among expert interviewees that Wikipedia contributors never intended for their content to be used by machine learning models (E2, E4). “The fundamental problem, as one expert puts it, “is that users that would have been quite happy to provide their content to other humans, are not necessarily happy to have their content fed to [machine learning] model. That is, when determining licensing rights, it seems that the current body of law makes the glaring omission of not mentioning, in the license, the expected and intended audience, at the time, for the licensing.” (E4). Another interviewee echoes this sentiment, noting that many Wikipedians feel it is unfair that their unpaid work is used by big tech companies to generate profit: “The ethical problem that I hear about most frequently from Wikipedians is that the situation doesn’t seem fundamentally fair. The editors produce this content without compensation, it is openly licensed, and then these big tech companies make so much money from LLMs.” (E2).

Another central concern among our EIs is that the overuse of digital commons content for training LLMs can deplete the commons by exhausting available resources and discouraging contributors who feel their work is exploited without recognition or compensation (E2, E5). The current AI race, with multiple tech companies competing to develop and fine-tune LLMs further exacerbates this issue, as does the fact that there has been no attention to reciprocity (or giving back to) the commons (E5). There is an ethical obligation to give back to the commons proportionately to what is extracted, stressing the importance of maintaining the sustainability of these shared resources (E5). Other participants concur with the need for giving back, suggesting that human-generated content will become increasingly valuable as it becomes rarer (E2) .

Expert interviewees also expressed significant concerns regarding the ethical implications of LLMs on linguistic and cultural (in)equities, especially when it comes to access and representation (E1, E3, E5, E6) . Because Wikipedia already relies on and extends English as a dominant language, training LLMs on this data highlights the risk of exacerbating existing gaps in access to technology and the Internet, particularly for speakers of less dominant languages. LLMs are limited in multiple languages due to the high costs of running these models, which raises questions about scalability and inclusivity (E5). LLMs, like Wikipedia, rely heavily on digitized documents, which exist mostly in dominant languages (E3). This reliance can marginalize cultures with less digital documentation, potentially leading to cultural erasure (E3). As one expert states, “[T]here's a concern around equity — leaving people behind or forcing people to [use] languages that [are not their] native languages. They are the languages of the colonizers.” (E5). To make matters worse, LLMs perform well with widely documented languages but struggle with less common ones, further entrenching systemic biases (E3). Ultimately, the language modeling community urgently needs to address these challenges to prevent long-term consequences and ensure broader language coverage and representation (E6).

Key Finding 5: Ethical concerns may be partially addressed via systemic changes to market incentives and license models, financial contributions to Wikipedia from big tech, and technical solutions related to data provenance and attribution.

While the existence of ethical issues as it relates to Wikipedia being used as a training data was not agreed upon unanimously, a majority of experts both identified ethical issues and proposed possible solutions to address such issues, proposing a variety of fixes related to licensing, market incentives, LLM explainability, and data provenance. On a broader scale, there is a need for a radical rethinking of market incentives and licensing models to ensure the sustainability of the digital commons (E5). One expert references Larry Lessig’s work on redesigning market incentives (Lessig, 2022), arguing that profit maximization should not be the sole reward mechanism (E5). Wikipedia has long thrived on the altruism of its volunteer contributors, but that model is endangered by the LLM economy in which digital commons content is extracted and exploited beyond the expectations of its original creators, and without respecting CC-BY-SA licensing. In contrast to this emphasis on market incentives, another interviewee calls for immediate financial contributions from big tech as a necessary step to support the commons (E2). “Big tech should contribute to the project,” this expert notes, “but it is very important that big tech does not itself have any editorial influence.” (E2).

The role of Wikipedia in this context is also a point of contention. While Wikipedia can contribute to the broader open-source movement, it is not solely responsible for solving the open culture challenge (E5). An online encyclopedia’s primary role is not to address these issues of open source and open culture, although it can play a supportive role (E5). One way that LLMs might address issues related to information literacy loss among users, for example, is the addition of explainability measures, which was frequently referenced by one expert. Such explainability would ensure that LLMs explain to users how and where they retrieved certain information or outputs (E3). Noting the opportunities in training LLMs to express “chain of thought”, this expert expressed how LLMs might showcase “processes that would probably look very familiar to Wikipedia and information literacy processes.” (E3). If developers focused less on the speed of outputs and emphasized “quality and information literacy instead,” we might end up with a model that is “able to talk to you about what it’s doing and what it’s thinking.” (E3). Finally, technical solutions related to data provenance and attributions, such as ensuring LLMs include citations, are necessary to maintain the integrity of the commons (E2). Going forward, human-generated content will be considered even more valuable as it becomes increasingly rare (E2). While there is a shared concern about the depletion of the commons and the need for giving back, the participants differ in their approaches to addressing these issues, with some experts advocating for systemic changes to market incentives and licensing models, while others emphasize immediate financial contributions and technical solutions to maintain the integrity of the commons.

Key Finding 6: Systemic biases in LLMs, which can be inherited from sources like Wikipedia, are inevitable, but can be mitigated via proactive efforts to diversify communities and content in the digital commons.

To explore the possibility of systemic biases in LLMs, we asked expert interviewees whether they believe these biases, which have been observed in Wikipedia, could also manifest in LLMs, and if they could provide any examples of such occurrences. EIs collectively highlighted the pervasive issue of systemic biases in LLMs and their potential perpetuation from sources like Wikipedia (E1-E6). As one expert stated, “Yes, there is a risk of these systemic biases being perpetuated in LLMs. To the extent there are systemic biases in Wikipedia, or the broader media landscape, then it is likely that the LLMs will be trained on these same biases” (E6).

While the encyclopedia itself has improved (and can continue to improve) by proactive efforts by Wikipedia editors to address bias through dedicated task forces, there are inherent biases due to limited content in various languages (E5). Such biases are inevitable, reflecting the human biases of contributors, and actively including more diverse communities can mitigate these effects (E5). Additional EIs affirm the risk of systemic biases in LLMs (E2, E4), with one pointing out that these biases are likely to be inherited from the media landscape (E2). Another expert discussed the dominance of Western documentation practices in Wikipedia, which can marginalize non-Western knowledge systems, and underscores the need for diverse sources to avoid cultural erasure. As noted by this expert, “Western culture has really strongly adopted this whole documentary practice around knowledge, and that fits with Wikipedia. But there’s all sorts of knowledge all around the world that aren’t documented in familiar ways, or maybe aren’t documented.” (E3). The issue of language diversity further compounds the issue: “A lack of language coverage (and therefore perspectives from these other language communities) is probably the most concerning aspect of bias to me with these models.” (E6). Despite the fact that Wikipedia does better than much of the internet in offering multilingual content, significant linguistic gaps still exist on Wikipedia, especially in underrepresented languages and communities. This lack of linguistic diversity in Wikipedia is mirrored in LLMs, which are disproportionately trained on dominant languages with a lack of representation of non-Western knowledge systems (E6). Finally, one of the most obvious examples of systemic biases in LLMs are issues in translation systems (E1). Overall, biases in training data are almost certain to appear in LLMs unless explicit efforts are made to counteract them (E1).

Resources

[edit]

Provide links to presentations, blog posts, or other ways in which you disseminate your work.

References

[edit]
  • boyd d, Crawford K (2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15:662–679. https://doi.org/10.1080/1369118X.2012.678878
  • Evenstein Sigalov S, Nachmias R (2017) Wikipedia as a platform for impactful learning: a new course model in higher education. Education and Information Technologies 22(6): 2959–2979.
  • Ford H (2022) Writing the revolution: Wikipedia and the survival of facts in the digital age. The MIT Press, Cambridge, Massachusetts
  • McDowell ZJ (2024) Wikipedia and AI: Access, representation, and advocacy in the age of large language models. Convergence: The International Journal of Research into New Media Technologies 30:751–767. https://doi.org/10.1177/13548565241238924
  • McDowell ZJ, Vetter MA (2021) Wikipedia and the Representation of Reality. 1st edition. New York, NY: Routledge.
  • McDowell Z, Vetter M (2022b) Fast “truths” and slow knowledge; oracular answers and Wikipedia’s epistemology. Fast Capitalism 19(1): 104–112.
  • McDowell Z, Vetter M (2024) The Re-alienation of the commons: Wikidata and the ethics of “free” data. International Journal of Communication 18: 590–608.