Jump to content

Research:Modeling collective meaning

From Meta, a Wikimedia project coordination wiki
Created
21:03, 17 July 2023 (UTC)
Collaborators
Duration:  2020-05 – 2024-06

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Open collaboration systems with a shared community of practice operate by developing, sharing, and formalizing concepts together -- collective meaning-making -- thereby enabling all their community members to work together effectively. In the context of Wikipedia, these concepts include article quality, vandalism, and other subjective aspects of collective work. As open collaboration systems grow, AI and machine learning have proven to be powerful tools for facilitating collaboration at scale by modeling and applying these shared concepts. In this paper, we examine the processes and practices of collective meaning-making in parallel with efforts to align AI system behavior with this collective meaning. Specifically, this paper describes a case study of modeling the quality of articles in Dutch Wikipedia using an AI model, while engaging in a meaning-making process on what quality is with Dutch Wikipedians. This case study blurs the line between social governance mechanisms and how meaning is reshaped in an AI model used in practice. Based on the case study, we present the collective meaning cycle, a framework that describes the bidirectional relationship between AI modeling and re-forming collective meaning within communities by leveraging a focus on usefulness in community- and developer-led AI audits. We also provide practical insights for designing participatory processes around developing probabilistic algorithmic systems in community contexts.

Introduction[edit]

The collective meaning cycle. The meaning cascades through collective “meaning-making” processes into the formal document genres (policies and guidelines) and into AI models. In turn, each document and model is shaped by the flow of meaning and operate as mediators to practice. Meaning cascades through policies and guidelines and is reshaped/translated/transformed in the process. Both the community and AI model straddle the line between meaning and practice and thus in the cycle provide a conduit for reflective processes to reshape the entire cascade. See Figure 7 for a visualization of the reflect/reshape process enabled by community co-development of models

The development and application of shared concepts is central to the functioning of open collaboration systems like Wikipedia. From Wikipedia’s central pillars (e.g., Verifiability[1]) to shared understandings of what types of articles are welcome and what articles should be deleted, Wikipedians manage broad swaths of collaborative work by discussing, debating, recording, formalizing, and citing shared concepts in the form of essays, guidelines, and policies. This process of developing and capturing shared understandings is well described by the research literature (Forte, Larco, and Bruckman 2009).
As Wikipedia scales, artificial intelligence (AI) and machine learning (ML) technologies have become core infrastructure for managing massive collaboration on the site. AI models are used to detect vandalism[2] (Adler et al. 2011; Kuo et al. 2024), measure the quality of articles[3] (Warncke-Wang, Cosley, and Riedl 2013), route newly created articles to reviewers by topic[4] (Asthana and Halfaker 2018), detect policy[5] and style[6] issues in text (Asthana et al. 2021), and even to generate encyclopedia articles directly from source material (Liu et al. 2018) – as a few examples. Without exception, each of these models is designed to “align” an algorithm’s behavior with a tangible shared concept – often extensively documented in Wikipedia (see corresponding footnotes above). The goal of AI alignment work is to steer AI models’ behaviors toward the documented group norms (Gabriel 2020).

In this paper, we report on a unique case study of building an AI model in the context where no documentation nor established norm were available to align the behavior of the AI model. This case study allows us to unpack the relationship between meaning captured in norms and desirable behavior of an AI model. Through this unpacking, we contribute the collective meaning cycle, a framework that describes the bidirectional relationship between AI modeling and collective meaning-making[7] within communities. The framework provides a deeper understanding of what it means to align algorithmic behavior in a social context and position an AI model as mediator between meaning and work. It provides implications for how AI developers might consider the design of well-aligned algorithmic systems in social contexts and adds a new thread to the conversations about genre ecologies (Spinuzzi and Zachry 2000) and articulation work (Suchman 1994) within open collaboration communities.
In the rest of the paper, we first review relevant literature and introduce our study method. Next, we describe the context of Dutch Wikipedia and their challenge of defining the article quality scale to model. We then highlight novel themes and insights that emerged throughout the process where we co-developed an AI model and the meaning of article quality with the Dutch Wikipedia community. Finally, we present the collective meaning cycle and discuss its implications for AI alignment and algorithm design in social settings.

Related work[edit]

Collective meaning and mediating documents[edit]

In this paper, when we refer to collective meaning, we intend to draw a connection to past work discussing the collective meaning-making (Reagle 2010) done in Wikipedia where community members engage in articulation work to build “shared understandings” of how to build and maintain an encyclopedia (Suchman 1994; Forte, Larco, and Bruckman 2009). We intend for collective meaning to represent the fundamental shared understanding about collective practice untransformed, and therefore not distorted (Latour 2007), by the act of translation into a formal document[8]

In order to be more easily shared and re-used, Wikipedians have created a formalized document genre (Spinuzzi and Zachry 2000; Morgan and Zachry 2010) called policies and guidelines as a foundational component of the distributed governance structure in Wikipedia (Forte, Larco, and Bruckman 2009). These documents play a mediating role (Morgan and Zachry 2010) by drawing connections between the practice of editing Wikipedia and collective meaning. Despite their drawbacks (inherent translation/distortion), these policy and guideline documents are useful as cite-able mediators of the collective meaning (Beschastnikh, Kriplean, and McDonald 2008). Past work has also called attention to these formal document genres are they themselves mediated by essays – an informal document genre that is used to reflect, critique, and interpret policy in specific contexts (Morgan and Zachry 2010). Taken together, the literature paints a clear picture of how collective meaning is made/refined (Reagle 2010), formalized (Forte, Larco, and Bruckman 2009), mediated (Morgan and Zachry 2010), and applied (Beschastnikh, Kriplean, and McDonald 2008) to form a distributed governance system that closely aligns with Ostromian principles (Forte, Larco, and Bruckman 2009; Ostrom 1999).

Aligning AI models to collective meaning[edit]

AI models are increasingly used in community contexts including Wikipedia (Smith et al. 2020; Kuo et al. 2024). As the community grows, Wikipedia increasingly relies on AI models for governance (M ̈uller-Birn, Dobusch, and Herbsleb 2013). For example, ORES, an AI model hosting system, is widely used on Wikipedia for a variety of tasks, including identifying damaging edits in articles, assessing article quality, and routing newly created articles to reviewers based on their topics (Halfaker and Geiger 2020).
These AI models are developed to enact (Introna 2016) the artifacts they are supposed to reflect or express, such as the guidelines, policies, and collective meaning of article quality on Wikipedia. Recent efforts in AI alignment aim to develop models that ensure an AI’s behavior aligns with these artifacts within social and community contexts (Gabriel 2020; Sorensen et al. 2024).

In this paper, we argue that—in contrast to the standard problem formulation adopted in AI alignment research—AI models that are used to support the work of Wikipedians are also acting as mediators of collective meaning in a similar way to Wikipedia essays. Like other mediators, AI models “transform, modify, and distort” (Latour 2007) collective meaning during the translation process. That is to say that “all models are wrong,” and achieving perfect, unidirectional AI alignment with collective meanings is impossible (Sterman 2002). Instead, we argue that the translation between collective meanings and the application of AI models is naturally bidirectional like other mediating genres (policies, guidelines, and essays). In this work, we focus our exploration on this bidirectional relationship between AI models and collective meaning via collective auditing practices.
We are not the first to identify the power of participatory AI to encourage reflection (e.g. (Zhang et al. 2023)), but we are the first to connect this reflective, collective meaning making processes and formalization within a genre ecology. We are also the first to observe the structure of this reversal of the flow of meaning in situ.

Study Method[edit]

In this project, we adopted a participatory action research approach (Delgado et al. 2023; Kemmis et al. 2014) by working closely with Wikipedia community stakeholders to co-construct research plans and interventions. Specifically, we initiated the project together with the Dutch Wikipedia community to tackle a challenge they faced. Throughout the project, we engaged community stakeholders as co-inquirers and adhered to the community’s best practices, for example, by recording our activities with detailed ledgers using wiki pages for documentation and “talk pages” for discussion. During and after the project, we co-reflected the research process with community participants (Howard and Irani 2019). Through this reflection, we recognized the project as a unique case that offers a new perspective on the discourse around AI alignment, emphasizing the importance of a bidirectional process between AI models and the community’s collective meaning. In collaboration with a community partner (the “tool coach”, as described later) who served as a co-author, we wrote this paper to share our approach and insights in building AI models, policies, and collective meaning alongside communities.

Study Context[edit]

In May of 2019, we attended the Wikimedia Hackathon, a yearly in-person event organized by the Wikimedia Foundation that “brings together developers from all around the world to improve the technological infrastructure of Wikipedia and other Wikimedia projects.” As part of our activities at that event, we met technically inclined Wikipedians from Dutch Wikipedia who had heard about how article quality models were used in English Wikipedia (Anon 2024) and were interested in what it might take to set up such a model for Dutch Wikipedia. We worked together to file a request to build the models in the relevant task tracking system[9] and to populate the request with basic questions that are useful for understanding how a community like Dutch Wikipedia already thinks about article quality. For example, we ask: “How do Dutch Wikipedians label articles by their quality level?” and “What levels are there and what processes do they follow when labeling articles for quality?” The answers to these questions were surprisingly complicated. Many Wikipedia communities adopted an article quality scale similar to English Wikipedia, but Wikipedians from the Dutch language Wikipedia reported that they did not have a completed scale. Instead, they had some processes for tagging the lowest quality articles (“Beginnetje”) and highest quality articles (“Etalage”), but everything in between had no definition, despite community discussions about the quality of the encyclopedia since 2004[10]. This contrasts to English Wikipedia with levels from Stub, Start, C, B, GA, and FA in ascending order with strict definitions (Warncke-Wang, Cosley, and Riedl 2013).

At this point, it was clear that setting up an article quality model for Dutch Wikipedia would also require the complicated work of defining a set of guidelines. Participants in the discussion expressed their reluctance to simply adopt a scale from another language Wikipedia, where the community has its own customs[11]. We therefore followed the mechanisms that Wikipedians use to build consensus and shared understanding about their work. Our Dutch Wikipedian collaborator in May 2020 posted to De Kroeg[12] (“The cafe”), a central discussion space, about the potential of bringing article quality models to the local wiki and included information about how they had been used in other wikis. The proposal was met with light skepticism – concerns about whether an AI could detect article quality – but an agreement was reached that it was acceptable to start experimenting and allow people to use the predictions on an opt-in basis. Over the next 1.5 years, we engaged in an iterative sensemaking and engineering process using Wikipedians’ processes for performing articulation work (Suchman 1994) (or “meaning making” (Reagle 2010)) and their online spaces to co-develop an AI model and guidelines for assessing article quality in Dutch Wikipedia. Beyond the discussion in De Kroeg, we created an on-wiki project page for the effort[13] where we described the AI model, hosted technical descriptions of the quality scale (see Table 1), posted prediction sets for auditing, and discussed the ongoing work with whoever was interested. Our Dutch Wikipedian co-author gathered a small community of local Wikipedian collaborators around these documents and discussions in order to iterate with us. In the next section we describe aspects of this collaboration that make salient the co-development of collective meaning and AI models.

The case: Dutch Wikipedia Article Quality[edit]

The developer-driven development process[edit]

When we first set out to model article quality for Dutch Wikipedians, we wanted to use as much past work as we could before trying to define any new aspects of quality. As mentioned above, Dutch Wikipedians had already developed formal processes and definitions for the top and bottom quality classes (beginnetje and etalage respectively). Through discussion with our Wikipedian collaborators, we settled on a rough scale that added three quality levels between these two extremes:

  • B-class: Former etalage class articles, and a community compiled list with so-called “rough-diamonds” were assumed to be high quality but not high enough quality.
  • D-class: Articles that were tagged as beginnetje, but this tag was removed later on. We assumed these articles to be slightly higher quality than beginnetje.
  • C-class: Articles that were between B- and C-class. We ultimately decided to set a formal length criteria for these articles (between 3000 and 5000 bytes of text).

It was apparent to all involved that this scale was overly simplistic but we suspected that, through exploring the limitations, we might elicit the latent shared understanding (Suchman 1994) of the quality of articles from Dutch Wikipedians. Based on past work in aligning model behavior with communities of Wikipedians (Halfaker and Geiger 2020; Asthana et al. 2021), we planned to seek feedback and prompt iteration on the quality scale through the auditing process.

The initial audit[edit]

Figure 2: The iterative process of co-developing both the quality scale and AI model over time. Table 1 describes the details of a specific version of the scale or model.
Figure 2: The iterative process of co-developing both the quality scale and AI model over time. Table 1 describes the details of a specific version of the scale or model.
Quality guidelines AI Model
v1: A and E class are borrowed from pre-defined, on-wiki concepts. B is defined as “not quite A”, D is defined as “no longer E” and C is defined by article length. v1: Trained by gathering examples of template introductions/removals and using length constraints. We expected this model to be wrong but to help probe editors to elicit reflection what their quality scale should look like.
v2: B, C, and D classes are more clearly defined. For example, C-class requires the presence of an Infobox and D-class requires that there is at least one source in the article. v2: Trained using the same data as v1 but with minor technical improvements to the way that Infoboxes and references are tracked
v3: Refined version of v2 based on reflection when applying v2 scale to articles. Source requirement moved C-class and softened. v3: Trained using a mixture of data sourced from template usage (for A- and E-class) as well as the results of labeling activities and on-wiki re-labeling using the v3 quality scale.

Table 1: Alignment between the model versions and quality guidelines

The first step in our auditing process involved generating article quality predictions for all articles in Dutch Wikipedia and randomly sampling 20 articles from each predicted class for review (5 classes × 20 predictions = 100 articles in the assessment set). We used article text from the June 2021 database dump of Dutch Wikipedia[14] to generate predictions. Since the quality of Wikipedia articles is highly skewed with the vast majority of articles in the lower quality range, this stratified approach allowed our collaborators to assess the performance across the scale. We posted the list with predictions and links to the specific version of the article we scored on a wiki page and invited Dutch Wikipedians to leave open ended comments about each article and the prediction. Some evaluations clearly implied adjustments to the naive quality scale v1 (Table 1). For example, on an article predicted to be C-class, one Wikipedian commented (translated): “Not a ‘good’ article, and I would personally rate it as D because of its focus on a summary, and the lack of further sources beyond the one report. But, strictly speaking, does it seem to meet the criteria?” While this is just one example, there are many things going on in this comment. First, it directly critiques the model’s prediction and suggests that the article in question should be rated lower (D-class). It also raises concerns about “focus on a summary” with regards to writing quality, and calls out the lack of sources. It also challenges the naive C-class criteria we started with (3000-5000 characters) as capturing what this editor imagines the C-class should represent.

Many such comments were met with follow-up discussion. For example, another Wikipedian left the following comment on an article predicted to be D-class: “Only two sources that are both inaccessible, uninformative, clumsily edited. As far as I’m concerned, at the bottom of E.” While a third Wikipedian challenged the downgraded assessment with: “Please note that E is meant for real beginnetje, unless we find a (measurable) way (and agreement on this) to also include poor quality articles.” In this example, we can see source quality, information quality, and editing quality being raised as important criteria to add to the scale. We can also see another editor ensuring that there is still space in the lower end of the scale for articles that are even more lacking.
Beyond these concerns about the nature of quality and how a scale might be applied to these articles, our collaborators also noticed that our process for collecting articles for their review seemed to miss some critical features of quality such as the presence of Infoboxes[15]. Others noted that pages that were not seen as articles were included in the set – such as “list articles”, like Lijst van spelers van Middlesbrough FC (List of Middlesbrough FC players). We were able to address these issues directly through improved feature engineering and sampling methods, which led to AI model v2. Overall, the initial audit provided substantial new insights into what did and did not belong at each quality level. The quality scale v2 emerged through this meaning-making process well-articulated description of the new consensus.

Labeling and re-auditing[edit]

Figure 3: An abridged screenshot of the table we constructed for labeling and re-auditing. The “labels” column represents the original labels. The “def label” column was filled in after a discussion based on consensus.

With a clearer definition, we asked our collaborators to help us build a new version of the model by reaching consensus on the models’ article labels. This would allow us to encode this new shared understanding in examples that the quality model could learn from. We set out to build a stratified sample of articles to label. Since the consensus on the criteria for the two extreme classes (beginnetje and etalage) had not changed, we only needed to gather labels for the middle quality classes (B, C, and D). We gathered a sample of likely mid-class articles for labeling and applied model v2 to them. We sampled 25 B-predicted articles, 50 C-predicted articles, and 25 D-predicted articles for labeling. We sampled more C-predicted articles because we expected that the predictions for that quality class were less accurate due to the naive initial specification and therefore the actual labels for that group would be distributed across B and D classes. In order to ensure consensus on the label, we required three labels per article from different Wikipedians.

We observed significant disagreement in labeling, with 56 articles showing discrepancies among labelers. Figure 3 shows the first 8 rows of the table we constructed for re-auditing. We went back to the Wikipedians who performed the labeling work and discussed with them about why there might be so much disagreement. Our tool coach started a discussion around the requirement for “Het artikel bevat minstens ́eén bron” (The article contains at least one source). Several editors reported that they applied this source criteria very strictly and observed that old styles of sourcing content (e.g. via a comment associated with an edit) could be the reason that some seemingly high quality articles are getting labeled as lower quality. The discussion quickly turns into reflection about what aspects of quality they wish to capture in their scale. For example:

  • "Kijkend naar deze uitslagen, denken jullie dat de ‘geen bron-voorwaarde’ in de C-versie van de kwaliteitsschaal juist is? Of moet deze misschien versoepeld worden?” (Looking at these results, do you think the ‘no source condition’ in the C version of the quality scale is correct? Or perhaps it should be relaxed?)
  • “Van mij mag de grens ‘bron/geen bron’ wel een niveau hoger” (For me, the boundary ‘source/no source’ may be a level higher)
  • “[...] ik denk dat het bron-criterium wel een goede reflectie is van de kwaliteit.” (I think that the source criterion is a good reflection of the quality.)
  • “Jouw voorstel om de broneis te verplaatsen naar C spreekt me wel aan.” (Your proposal to move the citation requirement to C appeals to me.)

Based on this discussion, our local collaborator updated the quality scale to reflect the consensus to move the requirement to C-class and to soften the language of what can be considered a source (“eventueel als een algemene bron onder een kopje ‘literatuur’ of ‘externe link”’ which translates to “possibly as a general source under a heading ‘literature’ or ‘external link”). This resulted in quality scale v3. Through this discussion and re-auditing, we were able to get a dataset with labels more aligned with the updated quality scale. Finally, we used this dataset to train AI model v3. We ended up training the model on 32 examples from each quality class (32 × 5 = 160 total articles). Despite this small training set, we achieved 80.8% accuracy across the five quality classes and agreed to deploy the model for testing with our Wikipedian collaborators.

The community-driven development process[edit]

In parallel with the developer-driven labeling and auditing process, our Dutch Wikipedia collaborators, led by a “tool coach”, also drove a complementary effort to support the development of quality scales and AI models. The tool coach. One co-author of this paper is an administrator and active contributor on Dutch Wikipedia. She offered to work with the community as a “tool coach,” a term coined by Sumana Harihareswara to describe someone who fills a bridging role between communities and the technical contributors, helping out in the bits that maintainers are not great at, or don’t have time for (Harihareswara 2021). When we shared the first version of the article quality model with the Dutch Wikipedia community, she developed an effective strategy for communicating the strange and inconsistent AI behaviors that people would see.

The strange ducks[edit]

Our tool coach came up with a strategy of providing a space – a section on a wiki page – for our Dutch Wikipedia collaborators to share any behaviors they thought were wrong, unexpected, or otherwise worthy of discussion. She named this section “vreemde eenden in de bijt” which roughly translates to “strange duck in the pond” – a Dutch euphemism for odd things that don’t belong. She then developed a weekly cadence to review the submissions in discussions with Wikipedians on the associated talk page, and bring a summary of those discussions to the development team. As a local community member, she was able to help answer our developers’ questions about why a behavior is considered to be strange and what behaviors might be more aligned with Dutch Wikipedians’ expectations. She also discussed these issues with submitters in their native language and situated within their shared cultural context.

The gadget in situ[edit]

Figure 4: Article quality predictions appear on all article pages under the page title.
Figure 5: Article quality predictions appear on the revision history page. The colored rectangles to the left of the revision details denotes the prediction. Vandalism that lowers the article’s quality is made apparent by a yellow rectangle that shows a temporary drop in predicted quality.
Figure 6: Article quality predictions next to article links.

A key to making this strategy work was getting the model’s predictions in front of Wikipedians in the course of their regular activities on Wikipedia. To do so, we developed a JavaScript-based gadget that Wikipedians could enable in their Wikipedia account settings. This gadget offers automated article quality predictions while the editor browses and works on Wikipedia, as shown in Figures 4, 5, and 6.

Through the gadget in situ and the repository of strange ducks, our tool coach formed a cross-lingual, cross-cultural bridge between the development team and the Wikipedians to support community-driven reflections about the behavior of AI models and the implications of the quality scales. While the developer-driven audits and labeling campaigns provided focused opportunities for review and reflection among our Wikipedian collaborators, the community-driven process is more continuous and enables specific concerns to be raised, with specific examples, at any point in time. This technology and social process formed a key component of the co-development process – enabling and grounding reflection and renegotiation of the collective meaning of article quality between Wikipedians.

Fitting it all together[edit]

Figure 7 illustrates how the developer- and community-driven processes together enable the co-development of AI models and collective meaning. In particular, labeling disagreements and “strange ducks” are great example of a high value interaction in an active learning sense. If Wikipedians disagree on a data point or flag it as a strange duck, there are three potential reasons, as shown in Figure 7. Deciding which case the data point fits into is a matter of discussion, but regardless of the results of the discussion, the outcome is valuable to the functioning of the entire system. Either the model needs to change, the guidelines need to change, or the guidelines need clarification. Each of these represents an opportunity for meaning-making, reflection, and reshaping. Grounding the discussion on the specific examples of “strange ducks” and how they should be labeled seemed to focus the discussion on the usefulness of the rules expressed in the guidelines.

Collective Meaning Cascades but Strange Ducks Swim Upstream[edit]

Figure 7: The workflow for developing AI models and collective meanings in our case study. Data points with labeling disagreements or strange model predictions can prompt valuable discussions and lead to iterations of AI models, guidelines, or shared understandings of collective meaning.

In the case of Dutch Wikipedia’s article quality, we can see the behavior of a AI model at the intersection of several different branches of HCI and CSCW scholarship. The documentation describing the norms and practices around article quality assessment translates the collective meaning into a share-able representation of article quality itself for Wikipedians. Meaning cascades through meaning-making discussions into principles and best practices that implement those principles (principles and guidelines respectively). These documents form a genre ecology that captures the formal and informal concepts used by Wikipedians to articulate (work together) in Wikipedia. From concrete work practices to statements of principle, the entire cascade of meaning is intentionally kept in alignment. Wikipedia’s meaning-making processes (also described in policies[16] and guidelines[17]) for maintaining this alignment – all of which are built on top of peer discussion and documentation practices (Reagle 2010). The work of developing and refining an AI model in this context extends the rules from the on-wiki text documentation into the behavior of the model itself. In the same way that one might code rules into best practice documentation, one can see rules play out in the AI model’s behavior.

As Figure 1 suggests, in an approximate way, a policy document in Wikipedia documents and represents a way of understanding the collective meaning of Wikipedians, and guidelines represent a way of understanding policies. These policies and guidelines correspond to the principle and best practice documentation (Morgan and Zachry 2010) within the genre ecology (Spinuzzi and Zachry 2000). We assert that AI and machine learning models designed to apply guidelines also represent a way of understanding that guideline in a specific setting. As an algorithm, the AI model represents a set of executable rules that can be applied to any new, valid input. In our case, we can apply these rules in a repeatable and immediately objective way to any Wikipedia article.
All models are wrong (Sterman 2002) is a common aphorism that we find useful when considering the implications of this cascade of meaning. Models, as algorithmic mediators of process, “enact the objects they are supposed to reflect or express” but they are inherently imperfect in that they “transform, translate, distort, and modify the meaning or the elements they are supposed to carry.” (Introna 2016; Latour 2007) Rather than trying to develop a “correct” model, our goal is to design a model that is useful. With an AI model, usefulness is often measured through fitness statistics[18], but in our case, the collective auditing pattern (e.g. “strange ducks”) allowed us to go beyond detecting the error rates and to ask, “How much does this type of error affect the usefulness of the model?” and “Is this an error in the model; is it an opportunity to reflect on the guidelines; or is it an opportunity to re-make collective meaning?” These questions focus issues of alignment on the intended use of the model and away from the impossible and less actionable idea of correctness.

Further, this is not a special case for AI models. This pattern of error detection, utility assessment, and refinement is consistent across the cascade, from collective meaning to AI model. Just as we can detect modeling bugs through the application of the model and reassessment, we can detect bugs in best practices by exploring whether the model’s “bugs” are failures to accurately represent the guideline, whether the guideline fails to accurately represent the policies (the principles behind guidelines), or whether the policy fails to usefully reflect the shared collective meaning of the members of the community. In effect, strange ducks (and other participatory auditing practices) allowed Dutch Wikipedians to swim upstream in the cascade and make new collective meaning effectively.

For example, in the case of article citation requirements, the question of whether or not model behavior was a bug or not brought the discussion all of the way up to the level of collective meaning. The original guideline that required at least a single citation for being included in D-class seemed like a reasonable application of the collective meaning and policy around article quality. But in practice, when reviewing the predictions of a model based on that guideline and labeling new articles, it became clear that this was a misalignment between the collective meaning and practice. Through review and discussion, guidelines were updated to reflect this new shared understanding drawn directly from the context of work. And thus the modeling process was updated and applied in order to ensure that the new model (v3) would reflect this update to the meaning cascade.

Discussion & Conclusions[edit]

Models as a mediator in participatory governance[edit]

In this work, we observed AI models fill a similar conceptual role to other mediators present in Wikipedia’s document genre ecology. We find that considering these AI models as a novel genre in the ecology helps us understand the role they play in conveying meaning to practice and opens new doors considering how they fit together with the reflective, norms development and refinement strategies in communities of practice.
One topic that came up early in our work was, ”Why not just adopt English Wikipedia’s collective meaning and quality model?” After all, it was developed for Wikipedia, by Wikipedians. Our Dutch collaborators were very clear that they wished to come to their own definition and have their own model. We see Ostrom’s principles playing out in model development the way that they were observed to play out in policy and guidelines development by Forte et al. 2009: that self determination of the community must be recognized[19] and the appropriation and provision of common resources are adapted to local conditions[20]. At first, the Dutch Wikipedia community was apprehensive about welcoming AI models into their work. But through our Ostromian processes, Dutch Wikipedians were centered in the model development/meaning-making process. And just as Ostrom observed that rules are more likely to be followed by people if they had a hand in writing them, we observe that both models and guidelines are more likely to be appropriated by people if they had a hand in developing them.

Community audits as grounded reflection on utility[edit]

As we discuss above, Dutch Wikipedians had long struggled with defining “quality.” The community had attempted several different initiatives to apply their meaning making process to build shared understanding and identify collective meaning around what quality is since 2004 with each effort failing to build agreement. Our community partner (the tool coach) reflects that the iterative process of auditing the AI model and refining the guidelines we describe above kept Dutch Wikipedia participants engaged and more focused on the utility of the model/guidelines than on the correctness of either. As Sterman 2002 observed, so do we:

Because all models are wrong, we reject the notion that models can be validated in the dictionary definition sense of ‘establishing truthfulness’, instead focusing on creating models that are useful [...] We argue that focusing on the process of modeling [...] speeds learning and leads to better models, better policies, and a greater chance of implementation and system improvement.

In our words, grounding discussion of collective meaning in the utility of models applied in practice facilitated the making of meaning that was persistently latent despite several attempts to bring about a consensus.

Generalizability beyond Wikipedia[edit]

As one of the largest, most successful online communities for collective knowledge building, there is much for others to learn from Wikipedia for the development of AI models and collective meaning. Across a wide range of contexts, mediation is a pervasive pattern by which meaning gets applied in practice, whether through AI models (Halfaker and Geiger 2020; Asthana et al. 2021), deterministic algorithms (Introna 2016), or documentary practices (Forte, Larco, and Bruckman 2009; Morgan and Zachry 2010). We suggest that efforts in AI alignment (Gabriel 2020; Sorensen et al. 2024) should consider the crucial role of mediating artifacts in any context, including AI models as mediators themselves. For example, considering the law as a mediator and the spirit of the law as the collective meaning, we encourage developers to promote discussion around how an AI model or algorithm enacting a law also enacts the spirit of that law. In some contexts, having an AI model applied in practice helps us ground the discussion around utility rather than correctness (Sterman 2002). In other contexts, an AI model applied in practice might help initiate difficult discussions about collective meaning. The reported case study of Wikipedia and the collective meaning cycle offer valuable insights for AI development in socio-technical contexts.

Building on this work, future research in AI alignment should develop systems and methods that recognize and support the bidirectional flow of meaning between different layers in the collective meaning cascade framework. This suggests that researchers and practitioners should move away from the notion of a fixed, pre-existing collective meaning to which we need to align. Instead, it may be more productive to embrace a deeply conversational, multi-turn approach to AI alignment, which acknowledges that collective meaning is actively co-developed alongside policies, guidelines, AI models, and other mediators. These collective meaning-making processes can be more easily grounded in discussions of usefulness. Model co-development processes are a significant opportunity to make meaning more effectively and to find shared understanding where previous attempts have failed.


Bibliography

References[edit]

  1. WP:VERIF
  2. WP:VANDAL
  3. WP:ASSESS
  4. WP:WPDIR
  5. WP:NPOV
  6. WP:MOS
  7. c.f. (Reagle 2010)
  8. c.f. The “Spirit” of the law: Letter and spirit of the law
  9. ...
  10. [1]
  11. Wikipedia language communities have subsidiarity; local projects create their own rules, norms, and customs. By applying an AI model (that enacts a different set of rules/norms/customs) on another community, this principle would be lost and the AI model would be misaligned with the needs of that community
  12. ...
  13. ...
  14. [2]
  15. H:I
  16. E.g., WP:CON
  17. [E.g., https://enwp.org/WP:BOLD WP:BOLD]
  18. Statistical model validation
  19. c.f. principle #7 from Ostrom (1999)
  20. c.f. principle #2 from Ostrom (1999)
  21. Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational Linguistics and Intelligent Text Processing: 12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part II 12 (pp. 277-288). Springer Berlin Heidelberg.
  22. Warncke-Wang, M., Cosley, D., & Riedl, J. (2013, August). Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration (pp. 1-10).
  23. Halfaker, A. (2017, August). Interpolating quality dynamics in Wikipedia and demonstrating the Keilana effect. In Proceedings of the 13th international symposium on open collaboration (pp. 1-9).
  24. Asthana, S., Tobar Thommel, S., Halfaker, A. L., & Banovic, N. (2021). Automatically labeling low quality content on wikipedia by leveraging patterns in editing behaviors. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1-23.
  25. Introna, L. D. (2016). Algorithms, governance, and governmentality: On governing academic writing. Science, Technology, & Human Values, 41(1), 17-49.
  26. Morgan, J. T., & Zachry, M. (2010, November). Negotiating with angry mastodons: the wikipedia policy environment as genre ecology. In Proceedings of the 2010 ACM International Conference on Supporting Group Work (pp. 165-168).
  • Adler, B. T.; De Alfaro, L.; Mola-Velasco, S. M.; Rosso, P.; and West, A. G. 2011. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational Linguistics and Intelligent Text Processing: 12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part II 12, 277–288. Springer.
  • Anon. 2024. Anonymized for review. ANON.
  • Asthana, S.; and Halfaker, A. 2018. With few eyes, all hoaxes are deep. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW): 1–18.
  • Asthana, S.; Tobar Thommel, S.; Halfaker, A. L.; and Banovic, N. 2021. Automatically Labeling Low Quality Content on Wikipedia By Leveraging Patterns in Editing Behaviors. Proc. CM Hum.-Comput. Interact., 5(CSCW2).
  • Beschastnikh, I.; Kriplean, T.; and McDonald, D. 2008. Wikipedian self-governance in action: Motivating the policy lens. In Proceedings of the International AAAI Conference on Web and Social Media, 1, 27–35.
  • Delgado, F.; Yang, S.; Madaio, M.; and Yang, Q. 2023. The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’23. New York, NY, USA: Association for Computing Machinery. ISBN9798400703812.
  • Forte, A.; Larco, V.; and Bruckman, A. 2009. Decentralization in Wikipedia governance. Journal of Management Information Systems, 26(1): 49–72.
  • Gabriel, I. 2020. Artificial intelligence, values, and alignment. Minds and machines, 30(3): 411–437.
  • Halfaker, A.; and Geiger, R. S. 2020. ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia. Proc. ACM Hum.-Comput. Interact., 4(CSCW2).
  • Harihareswara, S. 2021. Sidestepping the PR Bottleneck: Four Non-Dev Ways To Support Your Upstreams. Sidestepping the PR Bottleneck: Four Non-Dev Ways To Support Your Upstreams - Coaching and Cheerleading. Accessed: 2024-05-27.
  • Howard, D.; and Irani, L. 2019. Ways of Knowing When Research Subjects Care. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, 1–16. New York, NY, USA: Association for Computing Machinery. ISBN 9781450359702.
  • Introna, L. D. 2016. Algorithms, governance, and governmentality: On governing academic writing. Science, Technology, & Human Values, 41(1): 17–49.
  • Kemmis, S.; McTaggart, R.; Nixon, R.; Kemmis, S.; McTaggart, R.; and Nixon, R. 2014. Introducing critical participatory action research. The action research planner: Doing critical participatory action research, 1–31.
  • Kuo, T.-S.; Halfaker, A. L.; Cheng, Z.; Kim, J.; Wu, M.H.; Wu, T.; Holstein, K.; and Zhu, H. 2024. Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia. In proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24. New York, NY, USA: Association for Computing Machinery. ISBN9798400703300.
  • Latour, B. 2007. Reassembling the social: An introduction to actor-network-theory. Oup Oxford.
  • Liu, P. J.; Saleh, M.; Pot, E.; Goodrich, B.; Sepassi, R.; Kaiser, L.; and Shazeer, N. 2018. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198.
  • Morgan, J. T.; and Zachry, M. 2010. Negotiating with angry mastodons: the wikipedia policy environment as genre ecology. In Proceedings of the 2010 ACM International Conference on Supporting Group Work, 165–168.
  • Muller-Birn, C.; Dobusch, L.; and Herbsleb, J. D. 2013. Work-to-rule: the emergence of algorithmic governance in Wikipedia. In Proceedings of the 6th International Conference on Communities and Technologies, 80–89.
  • Ostrom, E. 1999. Design principles and threats to sustainable organizations that manage commons. In Workshop in Political Theory and Policy Analysis, W99-6. Center for the Study of Institutions, Population, and Environmental Change. Indiana University, USA. www. indiana. edu.
  • Reagle, J. M. 2010. Good faith collaboration: The culture of Wikipedia. MIT press.
  • Smith, C. E.; Yu, B.; Srivastava, A.; Halfaker, A.; Terveen, L.; and Zhu, H. 2020. Keeping community in the loop: Understanding wikipedia stakeholder values for machine learning-based systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–14.
  • Sorensen, T.; Moore, J.; Fisher, J.; Gordon, M.; Mireshghallah, N.; Rytting, C. M.; Ye, A.; Jiang, L.; Lu, X.; Dziri, N.; et al. 2024. A Roadmap to Pluralistic Alignment. arXiv preprint arXiv:2402.05070.
  • Spinuzzi, C.; and Zachry, M. 2000. Genre ecologies: An open-system approach to understanding and constructing documentation. ACM Journal of Computer Documentation (JCD), 24(3): 169–181.
  • Sterman, J. D. 2002. All models are wrong: reflections on becoming a systems scientist. System Dynamics Review: The Journal of the System Dynamics Society, 18(4): 501–531.
  • Suchman, L. A. 1994. Suporting Articulation Work: Aspects of a Feminist Practice of Technology Production. In Proceedings of the IFIP TC9/WG9.1 Fifth International Conference on Woman, Work and Computerization: Breaking Old Boundaries - Building New Forms, 7–21. USA: Elsevier Science Inc. ISBN 0444819274.
  • Warncke-Wang, M.; Cosley, D.; and Riedl, J. 2013. Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration, 1–10.
  • Zhang, A.; Walker, O.; Nguyen, K.; Dai, J.; Chen, A.; and Lee, M. K. 2023. Deliberating with AI: Improving Decision- Making for the Future through Participatory AI Design and Stakeholder Deliberation. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1): 1–3