Research talk:ORES: Facilitating re-mediation of Wikipedia's socio-technical problems

This page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made.

This page was originally on Meta as a Flow board, per Meta:Babel/Archives/2016-11#Enabling flow on one page (Research:ORES paper). When Flow was uninstalled from Meta in 2016, this board was hackily transplanted to MediaWiki.org as mw:Talk:ORES/Paper. Now that Flow is being abandoned everywhere this export has been sent back to it's natural home on Meta.


edit
Archives

pre-Flow comments

It's alive!

Latest comment: 8 years ago4 comments4 people in discussion

Jtmorgan, User:Staeiou, JtsMN, and Ladsgroup. Please enjoy this forum. I'll start adding some threads here about essays I'll be writing up over the winter holidays. EpochFail (talk) 16:00, 19 December 2016 (UTC)Reply

Neat! Nice work, @EpochFail: Jtmorgan (talk) 23:31, 19 December 2016 (UTC)Reply

classy DarTar (talk) 17:29, 21 December 2016 (UTC)Reply

A lot better! Jeblad (talk) 07:45, 24 December 2016 (UTC)Reply

"Technology is lagging behind social progress"

Latest comment: 8 years ago3 comments2 people in discussion

@Staeiou:, @EpochFail:what do we mean by this statement? Jtmorgan (talk) 17:31, 21 December 2016 (UTC)Reply

Good Q. So here's where I usually show screen shots Huggle's UI changes as a demonstration in the talks I give about this. I argue that Wikipedia's quality control systems looks almost identical to how they looked in "The banning of a vandal" paper. In that way, this set of technologies hasn't adapted to take on the new, extended standpoint we have that includes the value of newcomer socialization.

In contrast, consider the Teahouse, inspire campaigns, and studies of newcomer socialization. Consider the the WikiProjects that have organized around newcomer socialization and specifically about supporting our most at-risk editors/content. When considering all this, Snuggle, was just one interesting (yet mostly insubstantial) note in what we'd expect to be a substantial innovation period.

This observation is the basis for the idea of an innovation catalyst like ORES. "Why isn't anyone experimenting with better ways to do quality control?" My answer is that "it's hard!" The only people who build advanced quality control tools are computer scientists. ORES is positioned to change that -- to reduce barriers and to spur technological progress to match our social progress in thinking about newcomers. EpochFail (talk) 18:04, 28 December 2016 (UTC)Reply

Jtmorgan, check out this talk @ 24:45. Here I make the points that "technology should have progressed" and use some screenshots of Huggle to demonstrate that it didn't in places it ought to have.

Do you think this is convincing? EpochFail (talk) 16:46, 2 January 2017 (UTC)Reply

nudges, A/B tests, and "the ineffectiveness of Growth experiments"

Latest comment: 8 years ago9 comments4 people in discussion

As currently written, it sounds like we are setting up a straw man argument here, and I don't yet see what this review of previous interventions motivates. Where are we going with this? Jtmorgan (talk) 17:34, 21 December 2016 (UTC)Reply

OK so my thoughts are here that "we've tried the obvious solutions" that are dominant in the literature. This is why I point to "nudge". I want to use that as a placeholder for a set of strategies that are dominant in the HCI literature:

Identify a desired behavioral change
Redesign interface to "nudge" behavior
Run an A/B test to see if nudge is effective
Iterate.

In nearly all cases, we have shown a lack of substantial (and maybe even significant) effect from these UI nudges (see Growth Experiments). Jtmorgan and I know that the Teahouse is a notable exception, but in a lot of ways, this is the point. The Teahouse isn't a nudge. It's a self-sustaining sub-system of Wikipedia built to empower Wikipedians to solve their own problems. It's far less removed than an interface change intended to direct behavior. EpochFail (talk) 17:57, 28 December 2016 (UTC)Reply

Agreed, and I think the Teahouse is a great example of a nudge vs. something else I don't have a word for yet. I don't want to make it a strawman, but I do think that there is something pretty different.

This might be a side-rant, but I have really come to find the "nudge" approach problematic and would be happy to challenge that in this paper. And I certainly got caught up in that mindset years ago, wanting to make a few small changes that would fix everything forever. It is a really appealing approach for a lot of reasons. There are a few nudges that do genuinely work, and those get really high-profile publications and TED talks. But we rarely hear about all the nudges people try that don't work (publication bias, etc.). So there is this powerful belief that we can achieve social progress primarily through small, non-controversial technological changes. It's great when you find a nudge that works, but if your goal is changing long-established patterns of behavior, then a designer/developer should probably expect that 95% of their nudges won't work, rather than the opposite. Staeiou (talk) 21:33, 28 December 2016 (UTC)Reply

I see where you're coming from, but if we're going to critique nudges, we should engage Thaler and Sunstein more directly, and be clearer about how what we're proposing is different from a nudge. Because while Snuggle/Teahouse may not be nudges, both systems employ choice architecture in their design. And I expect that tools built on top of ORES will as well, at least based on some of the (probably outdated by now) ReviewStream concept mocks I've seen. Jtmorgan (talk) 21:19, 29 December 2016 (UTC)Reply

I like the point that @Jtmorgan:is making, while a (proposed language in quotes) 'systems-level intervention' might be the only approach shown to be effective, there are lower/'nudge-level' dimensions that matter _in that systems-level design_

An analogy that's difficult to delineate, but feels intuitive, is ecology. It's common to talk about systems-level change (e.g. The role of a swamp in cleaning water nearby), but have that outcome fail if one of the 'nudge-level details is left out (e.g. The correct form of bacteria may not be able to live in certain climates, and thus won't clean the water properly). JtsMN (talk) 15:03, 30 December 2016 (UTC)Reply

That styling is bizarre, and air can't figure out how to fix it. JtsMN (talk) 15:08, 30 December 2016 (UTC)Reply

I don't think it's fair to refer to the Teahouse as a designed thing in the way that "choice architecture" imagines designed things. I see the Teahouse as 5% designed things that formed a bedrock and conveyed a specific set of ideas and 95% what people chose to do with those ideas. The 5% of designed things affect very little direct change while the 95% of what the Teahouse hosts have made the Teahouse into is critical. The behaviors of Teahouse hosts may have been intended but they were not really designed. Instead they emerged. By relying on emergence, the founders of the Teahouse were taking a risk and better on "hearing to speech" -- if we design a space that is explicitly for a certain type of behavior (with the right nudges), then from that we might see a sustainable community turn our designed things into something that fits their view of newcomer socialization and support.

Also, the Teahouse nudges seem to be once-removed. Teahouse designers aren't nudging the newcomer (except maybe with what questions to ask -- not really nudging newcomers to stick around Wikipedia though). They were lightly nudging the Teahouse hosts maybe. But it seems to me that a more apt metaphor is that the Teahouse designer is that of founders. They made fertile ground for the Teahouse to grow, but what the Teahouse became was largely up to the Hosts who would take over running it.

Jtmorgan, does this jibe? I'm not sure my knowledge of the Teahouse's history is complete enough. EpochFail (talk) 15:33, 2 January 2017 (UTC)Reply

@EpochFail:I'm going to stop using the word "nudge" for a moment in order to draw attention to a variety of small design choices that we made when we created the Teahouse with the goal of fostering setting particular expectations, communicating particular messages, and suggesting particular courses of action for hosts and guests:

the Teahouse welcome message - a "nicer message" intended to contrast with other, less nice messages new editors might receive on their talkpages.
the 5 host expectations - a short list of !rules that communicate the way hosts should interact with guests at the Teahouse
Ask a question OR create a profile - two equally-weighted calls to action on the Teahouse landing page, communicating that guests are still welcome to participate (by creating a profile) even if they don't currently have a question to ask
Host profiles - an auto-sorted list of recently active Teahouse hosts who are willing to share a little about themselves, and will help out if contacted directly
"sign your post with five tidles" prompt - a prompt in the Teahouse Q&A gadget that teaches new editors how to sign their posts on talkpages
Host badges - a series of badges (basically Teahouse-specific barnstars) related to desirable behaviors that hosts can give to one another, and place on their profiles (pretty popular, for a while)

There are a lot more examples, but this makes my point I think. The point is that the Teahouse is in some ways a very designed thing, and that like most designed things it's full of nudges. Thaler and Sunstein didn't invent nudges, they just developed a theoretical framework to describe the phenomenon. The framework may make it seem like behavioral change is easier than it actually is, but that doesn't mean that small-scale interventions can't work.

I think what you're trying to get at in the criticism of nudges is that you can't expect any old small design tweak to work the way you want it to. You have to take a system/ecological perspective and understand the potential impact of that change in context, and the way people other than you will understand the change. You can't just add a "like" button to any old forum and expect that people will use it like they do on Facebook. So if we want to critique past WMF new editor engagement initiatives or any other unsuccessful design interventions in an honest way, we need to talk about how these specific changes were and were not contextually appropriate.

I agree with you that ultimately in the case of the Teahouse and pretty much any other successful, self-sustaining designed system the end users need to be able to appropriate/reshape/reinterpret the system according to their own needs and desires. But initial conditions often have a big impact on that process, and small design decisions make a big difference. If, when we created the Teahouse, we hadn't made "Welcome everyone" !rule #1, it would probably not be the friendly place it is today. Jtmorgan (talk) 23:08, 9 January 2017 (UTC)Reply

I think we're talking about two things here. The Teahouse itself is not a nudge, but design decisions (even those that are part of the Teahouse!) can cause nudges. Sure. It seems that maybe this hit a design-matters nerve? I'm certainly not trying to make the argument that design doesn't matter. Instead, I'm trying to make the argument that nudges/minor-design-changes alone are the wrong strategy for addressing a problematic cultural state like the dominant quality control culture in Wikipedia. We've tried many simple "nudges" directed at newcomers with little effect on retention (see the history of the Growth team). I think that addressing a problem like reduced/biased retention requires more than nudges to encourage newcomers to create profiles for themselves or to make copy edits rather than big contributions. It requires a culture shift. I think your "5 host expectations" is a good example of something that is totally not a nudge, but more of a purposeful, culture statement. By making "Welcome everyone" !rule #1, you weren't implementing a nudge at all. You were implementing a cultural norm. EpochFail (talk) 15:17, 24 January 2017 (UTC)Reply

Progress catalyst: Standpoints that haven't been operationalized now can be

Latest comment: 7 years ago4 comments3 people in discussion

Riffing off of Technology is lagging behind social progress, is the idea of a "progress catalyst" (better name ideas?) -- that by reducing some barriers to innovation in the space of quality control tools, ORES opens the doors for new standpoints, new objectivities (operationalizations), and therefor the expression of values that have until now been silenced by the difficulty and complexity of managing a realtime prediction model. EpochFail (talk) 18:08, 28 December 2016 (UTC)Reply

One way of thinking about this (and I think there are relationships to the two points above as well) is "what affordances does ORES provide?" As "progress catalyst" ORES affords the leveraging of prediction models to the community. JtsMN (talk) 19:45, 28 December 2016 (UTC)Reply

I'd like to turn this discussion towards the term "conversation" because I have found that it helped explain what I'd hoped to happen when building ORES. I'd like to put forth the idea of a "technological conversation". I see this process as better described by "access" than "affordance". When I say "technological conversation", I imagine a the expression of ideas through designed "tools" and that new "tools" will innovate in response to past "tools". (anyone know of any lit comparing innovation markets to a conversation and tracking design/affordance memes between, say, phone apps or something like that?)

Back before ORES, there were affordances that allowed the use of prediction models to the community, but one needed to engage in a complex discipline around Computer Science to do so effectively. The obvious result of this is that only computer scientists built tools that used prediction models to do useful stuff. Their professional vision was enacted and the visions/values/standpoints of others was excluded because they were not able to participate.

OK. Now looking at this like a conversation... Essentially, the only people who were able to participate at first were the computer scientists who valued efficiency and accuracy -- so they built prediction models that optimized these characteristics of Wikipedia quality control work (cite the massive literature on vandalism detection). We've seen that this has been largely successful -- their values were manifested by the technologies they built. E.g. when ClueBot NG goes down it takes twice as long to revert vandalism (cite Geiger & Halfaker 2013, "When the Levee Breaks"). These technologies have somewhat crystallized and stagnated design-wise -- we have a couple of auto-revert bots and a couple of human-computation systems to clean up what the auto-revert bots can't pick up. (We can see the stagnation in the complete rewrite of Huggle that implemented the same exact interaction design.) Snuggle is a good example of another Computer Scientist trying his hand at moving the technological conversation forward. While full of merits, this was more of a paternal approach of "I'll give you the right tool to fix your problems." While I believe that Snuggle helped push the conversation forward, it didn't open the conversation to non-CS participants.

OK onto the progress catalyst. To me, ORES represents a sort of stepping-back from the problem I want to solve (efficient newcomer socialization and support) and embracing the idea that progress is the goal and that I can't be personally responsible for progress itself. Us CS folk couldn't possibly be expected to bring all of Wikipedians' standpoints to a conversation about what technologies around quality control/newcomer socialization and other social wiki-work should look like. So how do we open up the conversation so that we can expand participation beyond this small set of CS-folk? How about we take out the primary barrier that only the CS-folk had crossed? If we're right, non-machine-learning-CS-folks will start participating in the technological conversation and with them, we'll see values/standpoints that us CS folk never considered.

One thing that makes this really exciting is the risk it entails. You lose control when you open a conversation to new participants. Up until now, I've been a relatively dominant voice re. technologies at the boundaries of Wikipedia. I have a set of things I think are important -- that I'd like to see us do ("Speaking to be heard"). But by opening things up, I enable others to gain power and possibly speak much more loudly than I can. Maybe they'll make newcomer socialization worse! Maybe they'll find newcomer socialization to be boring and they'll innovate in ways that don't help newcomers. That's the risk we take when we "Hear to speech". I'm admitting that I don't know what newcomer socialization & quality control ought to look like and I'm betting that we can figure it out together. EpochFail (talk) 16:07, 2 January 2017 (UTC)Reply

I think there are obvious parallels here to wikis in general--they reduce barriers to designing individual web pages as well as the organization of entire web sites, allowing more people to participate directly in both the creation of content and the way that content is organized and presented. New genres of website emerged--e.g. a collaborative design pattern repository, then a "crowdsourced" encyclopedia, the specialized open wikis created by interest-based communities. I also like your contrast between paternalistic approaches (Snuggle, and also TWA to be fair) and more open-ended, more hands-off, less explicitly directive approaches. ORES definitely fits that model. Still (and we've talked about this before), isn't it still "libertarian paternalism" in n the sense that you're providing defaults (e.g. a good faithiness score, maybe less stringent default vandalism thresholds) that you hope will nudge people towards behaving differently than they might otherwise? Aren't you still embedding values in ORES, albeit somewhat different ones and more loosely? (Jonathan, logged out, on phone) 205.175.119.191 (talk) 20:29, 3 April 2018 (UTC)Reply

Accountability of algorithms

Latest comment: 8 years ago6 comments2 people in discussion

I want to talk about past work on this and how it works for ORES.

Right now, ORES' primary mechanisms for accountability look a lot like the rest of software around Wikipedia. We have public work boards, public processes, public (machine readable) test statistics, and we publish datasets for reuse and analysis. We encourage and engage in public discussions about where the machine learning models succeed and fail to serve their intended use-cases. Users do not have direct power over the algorithms running in ORES, but they can affect them through the same processes that are infrastructures are affected in Wikipedia.

This may not sound as desirable as a fully automated accountability dream that allows users more direct control over how ORES operates, but in a way, it may be more desirable. I like to think of the space around ORES in which our users build false positive reports and conversations take place as a massive boundary object through which we're slowly coming to realize what types of control and accountability should be formalized through digital technologies and/or rules & policies.

At the moment, it seems clear that the next major project for ORES will be a means to effectively refute ORES's predictions/scorings. Through the collection of false positive reports and observations about the way that people use them, we see a key opportunity to enable users to challenge ORES' predictions and provide alternative assessments that can be included along with ORES' predictions. That means, tool developers who use ORES will find ORES' prediction and any manual assessments in the same query results. This is still future work, but it seems like something we need and we have already begun investing resources in bringing this together. EpochFail (talk) 18:17, 28 December 2016 (UTC)Reply

Here's some notes that JtsMN posted in the outline.

Jake's Notes
accountability thread for future discussion, with an example
models that stop discriminatory practice against anons may have other effects

perhaps switching to GradientBoosting from LinearSVC helps anons, and harms e.g. the gender gap

EpochFail (talk) 18:27, 28 December 2016 (UTC)Reply

My thinking on this line is mostly as a discussion point. I think the point you make above is reasonable, and for a given Discussion section, I think subsections of "fully automated accountability dream" and "sociotechnical oversight" are both super interesting. JtsMN (talk) 19:39, 28 December 2016 (UTC)Reply

Also, to be clear, I very much agree that "effective refutation" is a super interesting direction for accountability. JtsMN (talk) 19:45, 28 December 2016 (UTC)Reply

Agreed on the sub-sections. JtsMN, how would you define the "fully automated accountability dream"? Here's what I'd do about "sociotechnical oversight".

Sociotechnical oversight

Thinking about boundary objects. We don't yet know what types of oversight will be necessary and how people will want to engage in it.
So we designed open channels and employed wiki pages to let others design their means of reporting false positives and other concerns.
- Public wiki pages, web forum, mailing list, work logs, open workboard for project management and prioritization of work.
- We also worked with local confederates from different communities to help socialize ORES & ORES-related tools as well as to design a process that would work for their community. These confederates helped us with translations and to iterate on solutions with communities who we could otherwise not effectively work with.
Rewards:
- We learned that humans are pretty good at "seeing" into the black box.
- We saw effective oversight occur in some interesting cases (anon bias, Italian "ha", etc.)
- We saw themes emerge in how people want to engage in oversight activities and this has driven the motivation for encoding some of this process in technology -- e.g. a means to review predictions and "effectively refute" them.
- We learned certain strategies to avoid -- e.g. sending everyone to a "central wiki" to report problems and concerns didn't really work for many communities. EpochFail (talk) 16:40, 2 January 2017 (UTC)Reply

I'm not 100% sure what this sort of thing looks like, but I'm gonna brain-dump, and we can go from there. I think this section has to be more speculative, and less anchored in ORES experiences thus far, but I think there points at which to tie it back.

Fully Automated Accountability At Scale?

Accountability seems to have three major factors

Verification that the system isn't biased along certain dimensions (e.g. protected groups)
Effect sizes
Ability to raise new dimensions along which bias is 'claimed'/hypothesized to be occurring

As such, the question of automating accountability hinges on these factors

There are techniques that would allow achieving the first one
- it's apparently common in ML circles to treat models as a 'black box', and seek to predict the output along different (hypothesized to be biased) dimensions
  - This is broadly automating the community review process that occurred for anons and false-positives.
- https://arxiv.org/abs/1611.04967
For point 2, at what point is 'a bias that has been shown' /meaningful/?
- There are clearly meaningful examples (anons)
- A rule of thumb used elsewhere is a definition of 'disparate impact'
  - could this be operationalized automatically?
- This could also be one dimension in which ORES could also support standpoints
- How does addressing one dimension of bias affect the others (the intersectionality question)?
  - e.g. the LinearSVC model is better for anons than GradientBoosting, but may harm gender bias efforts (if anons are almost always male, does enabling better anon participation harm the efficacy of WikiProject Women in Science?)
The third is more of an open question:
- Should it be community driven?
- Should there be effort to automate recognition of dimensions of bias? How do we distinguish between 'bias against swear words' (statistical signal), and 'bias against anons' (harm from statistical signal), if there is no community involvement?
While there's clearly a tension in between full automation and community participation, the question of legibility and scale is really important - as algorithmic infrastructure is formalized, what will failing the community actually look like?
A 'fully automated accountability system', like ORES, risks operationalizing the ideologies of the builders.
- It's not clear that full automation can ever be achieved while supporting standpoints and meaningful community oversight
- Lowering the barriers to accountability (e.g. proposing new dimensions of hypothesized bias, etc.) at scale may be a fundamentally sociotechnical problem
- However, automating "due diligence" may be 'good enough automation'. This could mean:
  - dimensions of accountability (e.g. protected classes)
  - others? JtsMN (talk) 17:35, 2 January 2017 (UTC)Reply

Our role as technologists: Are we just encoding our own ideologies?

Latest comment: 8 years ago3 comments2 people in discussion

I asked this question out of due diligence, but it probably warrants a big discussion. There's probably degrees to which we do and do not encode our own ideologies.

E.g. I think that machine learning is important to quality control. I have somewhat of a techo-centric view of things. I also see value in "efficiency" when it comes to Wikipedia quality control. It's from this standpoint that I saw the technical conversation and the barrier of developing machine learning systems as critical. So, lots of ideology getting encoded there.

On the other hand, by not specifically building user interfaces, we make space -- we "hear to speech" (see http://actsofhope.blogspot.com/2007/08/hearing-to-speech.html). So, maybe we encode our ideologies to an extent, but we do not continue past that extent and instead make space to hear what others want to "say" through their own technological innovation.

I think it is interesting to draw a contrast between this approach and what we see coming out of Facebook/Google/Twitter/etc. and their shrink-wrapped "intelligent" technologies that fully encode a set of values and provide the user with little space to "speak" to their own values. EpochFail (talk) 18:24, 28 December 2016 (UTC)Reply

It's an important question to ask and discuss. A lot of the foundational scholarship in the software studies, politics of algorithms, and values in design literatures involves pointing to systems and saying, "Look! Values! Embedded in design!" Most of those canonical cases are also examples of very problematic values embedded in design. So the literature often comes across as saying that it is a bad thing to encode values in design.

I take the position that it is impossible to not encode values into systems. To say that you aren't encoding values into systems is the biggest ideological dupe of them all (and pretty dangerous). Instead, the more responsible move (IMO) is to explicitly say what your values are, give explanations about why you think they are important and valuable, and discuss how you have encoded them into a system. Then others can evaluate your stated values (which they may or may not agree with) and your implementation of your values (which may or may not be properly implemented).

Even though no traditional GUI user interfaces are built as part of the ORES core project, an API is definitely an interface that has its own affordances and constraints. But I do think it is interesting to draw a parallel to Facebook and maybe Twitter in particular -- Twitter used to be a lot more open about third party clients using their API, and lots of the innovation in Twitter came from users (retweets, hashtags). But they have tightened down the API heavily in recent years, particularly when someone provides a third party tool that they feel goes against what they think the user experience of Twitter should be.

So to wrap this up, I guess there are two levels of values in ORES: 1) the values in an open, auditable API built to let anyone create their own interfaces, and 2) the values encoded in this specific implementation of a classifier for article/edit quality. For example, you could have an open API for quality that uses a single classifier trained only on revert data and doesn't treat anons as a kind of protected class. Staeiou (talk) 21:10, 28 December 2016 (UTC)Reply

I think that there's another angle that I want to concern myself with -- ethically. I think it's far more ethical for me (the powerful Staff(TM) technologist) to try to enable others rather than just use my loud voice to enact my own visions. Staeiou, I wonder what your thoughts are there? It looks like this fits with value (1) and I agree. But I'd go farther than say I simply value it to say that there might be some wrongness/rightness involved in choosing how to use power in this case. EpochFail (talk) 16:14, 2 January 2017 (UTC)Reply

Discussion about disabling Flow board

Latest comment: 8 years ago1 comment1 person in discussion

Hey folks, there's a discussion @ Meta:Babel/Archives/2016-11#Flow that, if it passes, would result in this discussion board getting disabled. It's a big mess. Your input would be valuable. I'll answer any questions you have. EpochFail (talk) 19:45, 1 January 2017 (UTC)Reply

Flow

Latest comment: 8 years ago1 comment1 person in discussion

Hello. This is just a note that there is an ongoing discussion to remove Flow (this discussion format) from Meta. You are welcome to comment here. ~ Matthewrbowker ^{Drop me a note} 00:33, 9 January 2017 (UTC)Reply

Been away for a while.

Latest comment: 8 years ago1 comment1 person in discussion

In the meantime, I proposed the Wikimedia Foundation Scoring Platform team to get some resources for ORES in the next fiscal year (starting July 1st). The process is going well, so I'm hopeful. I also worked with Ladsgroup to do a bunch of stuff for ORES Staeiou and I also ran a couple of workshops (Building an AI Wishlist [list] and Algorithmic dangers and transparency) at the dev summit. I got a few great conversations in, but I didn't have much time to focus on this paper.

I'm hoping to start a few new threads by CSCW so maybe we could catch up there. Here's a really loose train of thought I'm going to pick up later:

Discussion of integrating ORES into Draft review

Latest comment: 7 years ago1 comment1 person in discussion

en:Wikipedia:Village_pump_(technical)#How_hard_would_it_be_to_set_up_ORES_for_the_Draft:_space?

en:User:John Cummings:

So I've tried recently to find articles within the Draft: space from Wikipedia:Drafts to work on but I have found it very difficult because I have to go to each one individually to check what is in it. I think that having a page with an ORES rating would make this much easier. E.g pages with an ORES rating of 0.9 are probably more likely to only need a small amount of work to get them to being publishable than an article with a rating of 0.4.

en:User:Insertcleverphrasehere:

The other WP:RATER has ORES built in for page assessment as well (I have found it to be reasonably accurate).

EpochFail (talk) 18:54, 3 April 2018 (UTC)Reply

Reinterpretation of the meaning of ORES predictions

Latest comment: 7 years ago2 comments1 person in discussion

https://wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/

This shows evidence of appropriation and novel development that was spurred (or at least eased) by the availability of ORES.

It's worth noting that User:Ragesoss was the person who originally proposed that we allow ORES to accept injected features. I just took that to its logical conclusion in ORES and Sage started a project to turn it into something practically useful. See Phab:T160840 and [some work that Arlolra is doing to enable Tor users to contribute] EpochFail (talk) 20:06, 3 April 2018 (UTC)Reply

From Sageross's UI, when looking for suggestions for en:Charles Berkeley, you get:

===== Automated Suggestions: =====

Cite your sources! This article needs more references.

EpochFail (talk) 20:10, 3 April 2018 (UTC)Reply

Preserving the margins on digital platforms

Latest comment: 7 years ago4 comments1 person in discussion

https://medium.com/@gmugar/preserving-the-margins-on-digital-platforms-c42bdbab8dad

While [Open Online Platforms] welcome and encourage a wide range of participation, they have distinct terms of participation that constrain what we can and cannot do.

My take-away from this is that the "margins" are a source for innovation. The conflict between order and wide participation is negotiated and adapted in the margins. In ORES case, the intervention originated from the observation that Wikipedia was failing to adapt to a problem. The goal is to expand the margins around IUI tool developers in order to jump-start innovation/adaptation there.

Jmorgan what would the lit on genre ecologies have to say about this? EpochFail (talk) 21:17, 3 April 2018 (UTC)Reply

http://www.dourish.com/publications/2006/cscw2006-cyberinfrastructure.pdf

http://www.dwrl.utexas.edu/old/content/describing-assemblages

Mediated artifacts change the way that people engage. EpochFail (talk) 20:23, 5 April 2018 (UTC)Reply

I want to mash together the ideas around Genre Ecologies, Successors, Hearing to speech/Boundary preservation and ask what we'd expect to see. I think the answer is that we'd expect to see stuff like what Sage et al. are developing. EpochFail (talk) 20:35, 5 April 2018 (UTC)Reply

By hearing to speech, we're opening the door for re-mediation artifacts that can gain ground/support/anchor.

"By virtue of the way that ORES is designed, it creates more opportunity for different interpretations for what quality control means or how quality control is enacted." --J-Mo

"Can you say that a tool built on ORES fundamentally changes the way that people view quality control process?" --J-Mo

"The system for quality control has taken on a kind of funnel model where ... [lines of defense] ... and that has stabilized."

"JADE is part of the ORES system. It is a hearing-to-speech."

"JADE/ORES-transparency encourages people to articulate what they think quality means." EpochFail (talk) 20:59, 5 April 2018 (UTC)Reply

Title ideas -- what is ORES anyway?

Latest comment: 7 years ago3 comments1 person in discussion

Hey folks, I've been working on figuring out what to call ORES -- and in effect, what to title the paper. ORES is an intervention that enables "successors" to the status quo. It "expands/preserves the margin" of quality control tool development. It's a successor system itself -- given that we now value transparency, audit-ability, and interrogability and thus ORES has been developed with those values in mind.

I joked on twitter that we call ORES a "successor platform" -- as in a "platform for developing successors". I wonder if there's something from the genre ecologies literature that might give this a name. E.g. what does one call the ecosystem or maybe something that improves viability in the ecosystem? ORES enables an increase in ecological diversity for the purposes of boosting the adaptive capacity of the larger system. EpochFail (talk) 15:59, 4 April 2018 (UTC)Reply

Jmorgan & Staeiou: ^ EpochFail (talk) 15:59, 4 April 2018 (UTC)Reply

Facilitating re-mediation of Wikipedia's socio-technical problems. EpochFail (talk) 20:32, 5 April 2018 (UTC)Reply

ORES system (Dependency injection & Interrogability)

From a technical perspective, ORES is a algorithmic "scorer" container system. It's primarily designed for hosting machine classifiers, but it is suited to other scoring paradigms as well. E.g. at one point, our experimental installation of ORES hosted a Flesch-Kincaid readability scorer. The only real constraints are that the scorer must express it's inputs in terms of "dependencies" that ORES knows how to solve and the scores must be presentable in JSON (Javascript Object Notation).

Dependency injection

Latest comment: 7 years ago1 comment1 person in discussion

One of the key features of ORES that allows scores to be generated in an efficient and flexible way is a dependency injection framework.

Efficiency. For example, there are several features that go into the "damaging" prediction model that are drawn from the diff of two versions of an articles text (words added, words removed, badwords added, etc.) Because all of these features "depend" on the diff, the dependency solver can make sure that only one diff is generated and that all subsequent features make use of that shared data.

ORES can serve multiple scores in the same request. E.g. a user might want to gather a set of scores: "edit type", "damaging", and "good-faith". Again, all of these prediction models depend on features related to the edit diff. ORES can safely combine all of the features required for each of the requested models and extract them with their shared dependencies together. Given that feature extraction tends to take the majority of time in a real-time request request (partially for fetching data [IO], partially for computation time [CPU]), this allows the scoring for multiple, similar models to take roughly the same amount of time as scoring a single model.

Flexibility. When developing a model, it's common to experiment with different feature sets. A natural progression in the life of a model in the wild involves the slow addition of new features that prove useful during experimentation. By implementing a dependency system and a dependency solver, a model can communicate to ORES which features it needs and ORES can provide those features. At no point does a model developer need to teach ORES how to gather new data. All of the information needed for the solver to solve any given feature set is wrapped up in the dependencies.

Interrogability

The flexibility provided by the dependency injection framework let us implement a novel strategy for exploring how ORES' models make predictions. By exposing the features extracted to ORES users and allowing them to inject their own features, we can allow users to ask how predictions would change if the world were different. Let's look at an example to demonstrate. Let's say you wanted to explore how ORES judges unregistered (anon) editors differently from registered editors.

https://ores.wikimedia.org/v3/scores/enwiki/34234210/damaging?features

Returns:

        "damaging": {
          "features": {
            ...
            "feature.revision.user.is_anon": false,
            ...
          },
          "score": {
            "prediction": false,
            "probability": {
              "false": 0.938910157824447,
              "true": 0.06108984217555305
            }
          }
        }

This essentially means that the "damaging" prediction model concludes that the edit identified by the revision ID of 34234210 is not damaging with 93.9% confidence. We can ask ORES to make a prediction about the exact same edit, but to assume that the editor was unregistered.

https://ores.wikimedia.org/v3/scores/enwiki/34234210/damaging?features&feature.revision.user.is_anon=true

Returns:

        "damaging": {
          "features": {
            ...
            "feature.revision.user.is_anon": true,
            ...
          },
          "score": {
            "prediction": false,
            "probability": {
              "false": 0.9124151990561908,
              "true": 0.0875848009438092
            }
          }
        }

If this edit were saved by an anonymous editor, ORES would still conclude that the edit was not damaging, but with less confidence (91.2%). By following a pattern like this for a single edit or a set of edits, we can get to know how ORES prediction models account for anonymity.

This is a very powerful tool for interrogating ORES. Imagine being able to ask a law enforcement officer if they feel like they have probable cause for a search and then asking again how their answer would change if the suspect were black.

Interrogability isn't only useful for checking to see where ORES biases originate and what effects they have on predictions. Some of our users have started using our article quality prediction models (wp10) as a method for suggesting work to new editors. (Cite: Sage ross) By asking ORES to score a student's draft and then asking ORES to reconsider the predicted quality level of the article with one more header, one more image , or one more citation, they've built an intelligent user interface that can automatically recommend the most productive development to the article -- the change that will most likely bring it to a higher quality level. EpochFail (talk) 20:48, 4 April 2018 (UTC)Reply

Observation: Adoption patterns

When we designed and developed ORES, we were targeting a specific problem -- expanding the set values applied to the design of quality control tools to include recent a recent understanding of the importance of newcomer socialization. However, we don't have any direct control of how developers chose to use ORES. We hypothesize that, by making edit quality predictions available to all developers, we'd lower the barrier to experimentation in this space. However, it's clear that we lowered barriers to experimentation generally. After we deployed ORES, we implemented some basic tools to showcase ORES, but we observed a steady adoption of our various prediction models by external developers in current tools and through the development of new tools.

Showcase tools

In order to showcase the utility of ORES, we developed a two simple tools to surface ORES predictions within MediaWiki -- the wiki that powers Wikipedia: ScoredRevisions and the ORES Review Tool.

ScoredRevisions^[1] is a javascript-based "gadget" that runs on top of MediaWiki. When certain pages load in the MediaWiki interface (E.g. Special:RecentChanges, Special:Watchlist, etc.), the ScoredRevisions submits requests to the ORES service to score the edits present on the page. The javascript then updates the page with highlighting based on ORES predictions. Edits that are likely to be "damaging" are highlighted in red. Edits that might be damaging and are worth reviewing are highlighted in yellow. Other edits are left with the default background.

While this interface was excellent for displaying ORES potential, it had limited utility. First, it was severely limited by the performance of the ORES system. While ORES is reasonably fast for scoring a single edit, scoring 50-500 edits (the ranges that commonly appear on these pages) can take 30 seconds to 2 minutes. So a user is left waiting for the highlighting to appear. Also, because ScoredRevisions is only able to score edits after they are rendered, there was no way for a user to ask the system to filter edits ahead of time -- for example, to only show edits that are likely to be damaging. So the user needed to visually filter the long lists based on highlighted rows.

The ORES Review Tool^[2] is a MediaWiki extension implemented in PHP. It uses an offline process to score all recent edits to Wikipedia and to store those scores in a table for querying and quick access. This tool implemented similar functionality to 'ScoredRevisions but because it had pre-cached ORES scores in a table, it rendered highlights for likely damaging edits as soon as the page loaded, and it enabled users to filter based on likely damaging edits.

We released the ORES Review Tool as a "beta feature" on Wikimedia wikis were we were able to develop advanced edit quality models. The response was extremely positive. Over 26k editors in Wikipedia had manually enabled the ORES Review Tool by April of 2017. For reference, the total number of active editors across all languages of Wikipedia varies around 70k^[3], so this means that a large proportion of active editors consciously chose to enable the feature.

Adoption in current tools

Many tools for counter-vandalism in Wikipedia were already available when we developed ORES. Some of them made use of machine prediction (e.g. Huggle^[4], STiki, ClueBotNG), but most did not. Soon after we deployed ORES, many developers that had not previously included their own prediction models in their tools were quick to adopt ORES. For example, RealTime Recent Changes^[5] includes ORES predictions along-side their realtime interface and FastButtons^[6], a Portuguese Wikipedia gadget, began displaying ORES predictions next to their buttons for quick reviewing and reverting damaging edits.

Other tools that were not targeted at counter-vandalism also found ORES predictions -- specific that of article quality(wp10) -- useful. For example, RATER^[7], a gadget for supporting the assessment of article quality began to include ORES predictions to help their users assess the quality of articles and SuggestBot^[8], a robot for suggesting articles to an editor, began including ORES predictions in their tables of recommendations.

New tools

Latest comment: 7 years ago1 comment1 person in discussion

A screenshot of the Edit Review Filters interface with ORES score-based filters displayed at the top of the list

Many new tools have been developed since ORES has released that may not have been developed at all otherwise. For example, the Wikimedia Foundation developed a complete redesign on MediaWiki's Special:RecentChanges interface that implements a set of powerful filters and highlighting. They took the ORES Review Tool to it's logical conclusion with an initiative that they referred to as Edit Review Filters^[9]. In this interface, ORES scores are prominently featured at the top of the list of available features.

When we first developed ORES, English Wikipedia was the only wiki that we are aware of that had a robot that used machine prediction to automatically revert obvious vandalism^[10]. After we deployed ORES, several wikis developed bots of their own to use ORES predictions to automatically revert vandalism. For example, in PatruBot in Spanish Wikipedia^[11] and Dexbot in Persian Wikipedia^[12] now automatically revert edits that ORES predicts are damaging with high confidence. These bots have been received with mixed acceptance. Because of the lack of human oversight, concerns were raised about PatruBot's false positive rate but after consulting with the developer, we were able to help them find an acceptable threshold of confidence for auto-reverts.

One of the most noteworthy new tools is the suite of tools developed by Sage Ross to support the Wiki Education Foundation's^[13] activities. Their organization supports classroom activities that involve editing Wikipedia. They develop tools and dashboards that help students contribute successfully and to help teachers monitor their students' work. Ross has recently published about how they interpret meaning from ORES' article quality models^[14] and has integrate this prediction into their tools and dashboards to recommend work that students need to do to bring their articles up to Wikipedia's standards. See our discussion of interrogation in Section [foo].

↑ https://github.com/he7d3r/mw-gadget-ScoredRevisions
↑ ORES review tool
↑ https://stats.wikimedia.org/EN/TablesWikipediansEditsGt5.htm
↑ Notably, Huggle adopted ORES prediction models soon after we deployed
↑ RTRC
↑ pt:Wikipédia:Scripts/FastButtons
↑ en:WP:RATER
↑ wikipedia:User:SuggestBot<ref><ref>Cite suggestbot papers
↑ Edit Review Improvements/New filters for edit review
↑ The cluebot paper
↑ es:Usuario:PatruBOT
↑ fa:User:Dexbot
↑ https://wikiedu.org/
↑ https://wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/

EpochFail (talk) 15:22, 5 April 2018 (UTC)Reply

TODO: ORES system (Threshold optimizations)

Latest comment: 7 years ago1 comment1 person in discussion

When we first started developing ORES, we realized that interpreting the likelihood estimates of our prediction models would be crucial to using the predictions effectively. Essentially, the operational concerns of Wikipedia's curators need to be translated into a likelihood threshold. For example, counter-vandalism patrollers seek catch all (or almost all) vandalism before it is allowed to stick in Wikipedia for very long. That means they have an operational concern around the recall of a damage prediction model. They'd also like to review as few edits as possible in order to catch that vandalism. So they have an operational concern around the filter rate -- the proportion of edits that are not flagged for review by the model^[1].

By finding the threshold of prediction likelihood that optimizes the filter-rate at a high level of recall, we can provide vandal-fighters with an effective trade-off for supporting their work. We refer to these optimizations in ORES as threshold optimizations and ORES provides information about these thresholds in a machine-readable format so that tools can automatically detect the relevant thresholds for their wiki/model context.

Originally, when we developed ORES, we defined these threshold optimizations in our deployment configuration. But eventually, it became apparent that our users wanted to be able to search through fitness metrics to adapt their own optimizations. Adding new optimizations and redeploying quickly became a burden on us and a delay for our users. So we developed a syntax for requesting an optimization from ORES in realtime using fitness statistics from the models tests. E.g. "maximum recall @ precision >= 0.9" gets a useful threshold for a counter-vandalism bot or "maximum filter_rate @ recall >= 0.75" gets a useful threshold for semi-automated edit review (with human judgement).

Example:

https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true.'maximum filter_rate @ recall >= 0.75'

Returns:

  {"threshold": 0.299, ..., 
   "filter_rate": 0.88, "fpr": 0.097, "match_rate": 0.12, "precision": 0.215, "recall": 0.751}

This result shows that, when a threshold is set on 0.299 likelihood of damaging=true, then you can expect to get a recall of 0.751, precision of 0.215, and a filter-rate of 0.88. While the precision is low, this threshold reduces the overall workload of vandal-fighters by 88% while still catching 75% of (the most egregious) damaging edits.

↑ http://socio-technologist.blogspot.com/2016/01/notes-on-writing-wikipedia-vandalism.html

EpochFail (talk) 16:19, 5 April 2018 (UTC)Reply

Re-mediating what?

Latest comment: 7 years ago1 comment1 person in discussion

Beyond the obvious remediation of quality control processes itself.

Think technological probe:

The transparency encourages people to consider the relationship of their process to the algorithmic predictions that support them. "What is the proper role of algorithmic tools in quality control in Wikipedia?" See discussion of PatruBot in Spanish Wikipedia.

The availability of the algorithm allows people to critique dominant narratives about quality control issues. See ACTRIAL and the draft quality model. EpochFail (talk) 21:08, 5 April 2018 (UTC)Reply

Talk about proto-jade (misclassification pages)

When we first deployed ORES, we reached out to several different wiki-communities and invited them to test out the system for use in patrolling for vandalism. In these announcements, we encouraged editors to install ScoredRevisions -- the only tool that used made use of ORES' edit quality models at the time. ScoredRevisions both highlights edits that are likely to be damaging (as predicted by the model) and displays the likelihood of the prediction as a percentage.

It didn't take long before our users began filing false-positive reports on wiki pages of their own design. In this section we will describe three cases where our users independently developed these false-positive reporting pages and how they used them to understand ORES, the roles of automated quality control in their own spaces, and to communicate with us.

Case studies

Report mistakes (Wikidata)

ORES report mistakes -- improvements table.

When we first deployed prediction models for Wikidata, a free and open knowledge base that can be read and edited by both humans and machines^[1], we were breaking new ground by building a damage detection classifier based on a structured data wiki^[2]. So we created a page called "Report mistakes" and invited users to tell us about mistakes that the prediction model made on that page but we left the format and structure largely up to the users.

Within 20 minutes, we received our first report from User:Mbch that ORES was reporting edits that couldn't possibly be vandalism as potentially damaging. As reports streamed in, we began to respond to them and make adjustments to the model building process to address data extraction bugs and to increase the signal so that the model differentiate damage from non-damaging edits. After a month of reports and bug fixes, we decided to build a table to represent the progress that we made in iterations on the model against the reported false-positives. See Figure ?? for a screenshot of the table. Each row represents a mis-classified edit and each column describes the progress we made in not detecting those edits as damaging in future iterations of the model. Through this process, we learned how Wikidata editors saw damage and how our modeling and feature extraction process captured signals in ways that differed from Wikidata editors' understandings. We were also able to publicly demonstrate improvements to this community.

Patrolling/ORES (Italian Wikipedia)

Italian Wikipedia was one of the first wikis where we deployed basic edit quality models. Our local collaborator who helped us develop the language specific features, User:Rotpunkt, created a page for ORES^[3] with a section for reporting false-positives ("falsi positivi"). Within several hours, Rotpunkt and a few other edits started to notice some trends in their false positive reports. First, Rotpunkt noticed that there were several counter-vandalism edits that ORES was flagging as potentially damaging, so he made a section for collecting that specific type of mistake ("annullamenti di vandalismo"). A few reports later and he added a section for corrections to the verb for "have" ("correzioni verbo avere"). Through this process, editors from Italian Wikipedia were essential performing a grounded theory exploration of the general classes of errors that ORES was making.

Once there were several of these mistake-type sections and several reports within each section, Rotpunkt reached out to us to let us know what he'd found. He explained to us (via our IRC channel) that many of ORES mistakes were understandable, but there were some general trends in mistakes around the Italian verb for have: "ha". We knew immediately what was likely to be the issue. It turns out that "ha" in English and many other languages is laughing -- an example of informal language that doesn't belong in an encyclopedia article. While the word "ha" in Italian translates to have and is perfectly acceptable in articles.

Because of the work of Rotpunkt and his collaborators in Italian Wikipedia, we were able to recognize the source of this issue (a set of features intended to detect the use of informal language in articles) and to remove "ha" from that list for Italian Wikipedia. This is just one example of many issues we were able to address because of the grounded theory and thematic analysis performed by Italian Wikipedians.

PatruBOT (Spanish Wikipedia)

Soon after we released support for Spanish Wikipedia, User:jem developed a robot to automatically revert damaging edits using ORES predictions (PatruBOT). This robot was not running for long before our discussion pages started to be bombarded with confused Spanish-speaking editors asking us questions about why ORES did not like their work. We struggled to understand the origin of the complaints until someone reached out to us to tell us about PatruBOT and its activities.

We haven't been able to find the source code for PatruBot, but from what we've been able to gather looking at its activity, it appears to us that PatruBOT was too sensitive and was likely reverting edits that ORES did not have enough confidence about. Generally, when running an automated counter-vandalism bot, the most immediately operational concern is around precision (the proportion of positive predictions that are true-positives). This is because mistakes are extra expensive when there's no human judgement between a prediction and a revert (rejection of the contribution). The proportion of all damaging edits that are actually caught by the bot (recall) is a secondary concern to be optimized.

We generally recommend that bot developers who are interested in running an automated counter-vandalism bot use a threshold that maximizes recall at high precision (90% is a good starting point). According to our threshold optimization query, the Spanish Wikipedia damaging model can be expected to have 90% precision and catch 17% of damage if the bot only reverted edits where the likelihood estimate is above 0.959.

We reached out to the bot developer to try to help, but given the voluntary nature of their work, they were not available to discuss the issue with us. Eventually, other editors who were concerned with PatruBOT's behavior organized an informal crowdsourced evaluation of the fitness of PatruBOT's behavior^[4] where they randomly sampled 1000 reverts performed by PatruBOT and reviewed their appropriateness. At the time of writing, PatruBOT has been stopped^[5] and the informal evaluation is ongoing.

Discussion

Latest comment: 7 years ago2 comments1 person in discussion

These case studies in responses to ORES provide a window into how our team has been able to work with the locals in various communities to refine our understandings of their needs, into methods for recognizing and addressing biases in ORES' models, and into how people think about what types of automation their find acceptable in their spaces.

Refining out understandings and iterating our models. The information divide between us researchers/engineers and those member of a community is often wider than we realize. Through iteration with the Wikidata and Italian models, we learned about incorrect assumptions we'd made about how edits happen (e.g. client edits in Wikidata) and how language works (e.g. "ha" is not laughing in Italian). It's likely we'd never be able to fully understand the context in which damage detection models should operate before deploying the models. But these case studies demonstrate how, with a tight communication loop, many surprising and wrong assumptions that were baked into our modeling process could be identified and addressed quickly. It seems that many of the relevant issues in feature engineering and model tuning become *very* apparent when the model is used in context to try to address a real problem (in these cases, vandalism).

Methods for recognizing and addressing bias. The Italian Wikipedians showed us something surprising and interesting about collaborative evaluation of machine prediction: thematic analysis is very powerful. Through the collection of ORES mistakes and iteration, our Italian collaborators helped us understand general trends in the types of mistakes that ORES made. It strikes us that this a somewhat general strategy for bias detection. While our users certainly brought their own biases to their audit of ORES, they were quick to discover and come to consensus about trends in ORES' issues. Before they had performed this process and shared their results with us, we had no idea that any issues was present. After all, the fitness statistics for the damage detection model looked pretty good -- probably good enough to publish a research paper! Their use of thematic analysis seems to like a powerful tool that developers will want to make sure is well supported in any crowd based auditing support technologies.

How people think about acceptable automation. In our case study, Spanish Wikipedians are in the processes of coming to agreements about what roles are acceptable for automated agents. Through observation of PatruBOT's behavior, they have decided that the false discovery rate (i.e., 1 - precision) was too high by watching the bot work in practice and they started their own independent analysis to find quantitative, objective answers about what the real rate is. Eventually they may come to a conclusion about an acceptable rate or they may decide that no revert is acceptable without human intervention.

↑ https://wikidata.org
↑ Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.
↑ it:Progetto:Patrolling/ORES>
↑ es:Wikipedia:Mantenimiento/Revisión_de_errores_de_PatruBOT/Análisis
↑ [[:es:Wikipedia:Café/Archivo/Miscelánea/Actual#Parada_de_PatruBOT

EpochFail (talk) 21:28, 5 April 2018 (UTC)Reply

wikidata:Wikidata:ORES/Report_mistakes & it:Progetto:Patrolling/ORES & pt:Wikipédia:Projetos/AntiVandalismo/Edições_possivelmente_prejudiciais & [The spanish] EpochFail (talk) 21:30, 5 April 2018 (UTC)Reply

ORES system overview

ORES can be understood as a machine prediction model container service where the "container" is referred to as a ScoringModel. A ScoringModel contains a reference to a set of dependency-aware features (see discussion of Dependency Injection) and has a common interface method called "score()" that takes the extracted feature-values as a parameter and produces a JSON blob (called a "score"). ORES is responsible for extracting the features and serving the score object via a RESTful HTTP interface. In this section we describe ORES architecture and how we have engineered the system to support the needs of our users.

Horizontal scaling

In order to be a useful tools for Wikipedians and tool developers, the ORES system uses distributed computation strategies to serve a robust, fast, high-availability service. In order to make sure that ORES can keep up with demand, we've focused on two points at which the ORES system implements horizontal scale-ability: the input-output(IO) workers (uwsgi^[1]) and the computation workers (celery^[2]). When a request is received, it is split across the pool of available IO workers. During this step of computation, all of the root dependencies are gathered for feature extraction using external APIs (e.g. the MediaWiki API^[3]). Then these root dependencies are submitted to a job queue managed by celery for the CPU-intensive work. By implementing ORES in this way, we can add/remove new IO and CPU workers dynamically to the service in order to adjust with demand.

Robustness

Currently, IO workers and CPU workers are split across a set of 9 servers in two datacenters (for a total of 18 servers). Each of these 9 servers are running 90 CPU workers and 135 IO workers. The major limitation for running more workers on a single server is memory (RAM) due to the requirements for keeping several different prediction models loaded into memory. IO and CPU workers are drawing from a shared queue, so other servers can take over should any individual go down. Further, should one datacenter go fully offline, our load-balancer can detect this and will route traffic to the remaining datacenter. This implements a high level of robustness and allows us to guarantee a high degree of uptime. Given the relative youth of the ORES system, it's difficult to give a fair estimate of the exact up-time percentage^[4].

Batch processing

Many some of our users' use-cases involve the batch scoring of a large number of revisions. E.g. when using ORES to build work-lists for Wikipedia editors, it's common to include an article quality prediction. Work-lists are either built from the sum total of all 5m+ articles in Wikipedia or from some large subset specific to a single WikiProject (e.g. WikiProject Women Scientists claims about 6k articles^[5].). Robots that maintain these worklists will periodically submit large batch processing jobs to score ORES once per day. It's relevant to note that many researchers are also making use of ORES for varying historical analyses and their activity usually shows up in our logs as a sudden burst of requests.

The separation between IO and CPU work is very useful as it allows us to efficiently handle multi-score requests. A request to score 50 revisions will be able to take advantage of batch IO during the first step of processing and still extract features for all 50 scores in parallel during the second CPU-intensive step. This batch processing affords up to a 5X increase in time to scoring speed for large numbers of scores^[6]. We generally recommend that individuals looking to do batch processing with ORES submit requests in 50 score blocks using up to two parallel connections. This would allow a user to easily score 1 million revisions in less than 24 hours in the worst case scenario that none of the scores were cached -- which is unlikely for recent Wikipedia activity.

Single score processing

Many of our users' use-cases involve the request for a single score/prediction. E.g. when using ORES for realtime counter-vandalism, tool developers will likely listen to a stream of edits as they are saved and submit a scoring request immediately. It's critical that these requests return in a timely manner. We implement several strategies to optimize this request pattern.

Single score speed. In the worst case scenario, ORES is generating a score from scratch. This is the common case when a score is requested in real-time -- right after the target edit/article is saved. We work to ensure that the median score duration is around 1 second. Currently our metrics tracking suggests that for the week April 6-13th, our median, 75%, and 95% score response timings are 1.1, 1.2, and 1.9 seconds respectively.

Caching and Precaching. In order to take advantage of the overlapping interests around recency between our users, we also maintain a basic LRU cache (using redis^[7]) using a deterministic score naming scheme (e.g. enwiki:1234567:damaging would represent a score needed for the English Wikipedia damaging model for the edit identified by 123456). This allows requests for scores that have recently been generated to be returned within about 50ms via HTTPS.

In order to make sure that scores for recent edits are available in the cache for real-time use-cases, we implement a "precaching" strategy that listens to a highspeed stream of recent activity in Wikipedia and automatically requests scores for a specific subset of actions (e.g. edits). This allows us to attain a cache hit rate of about 80% consistently.

There are also secondary caches of ORES scores implemented outside of our service. E.g. the ORES Review Tool (an extension of MediaWiki) roughly mimics our own precaching strategy for gathering scores for recent edits in Wikipedia. Since this cache and its access patterns are outside the metrics gathering system we use for the service, our cache hit rate is actually likely much higher than we're able to report.

De-duplication. In real-time use-cases of ORES it's common that we'll receive many requests to score the same edit/article right after it was saved. We use the same deterministic score naming scheme from the cache to identify scoring tasks to ensure that simultaneous requests for that same score attach to the same result (or pending result) rather that starting a duplicate scoring job. This pattern is very advantageous in the case of precaching, because of our network latency advantage: we can generally guarantee that the precaching request for a specific score precedes the external request for a score. The result is that the external request for the a score attaches to the result of a score generation process that had started before the external request arrived. So even worst case scenarios where the score is not yet generate often result in a better-than-expected response speed from the tool developer/users' point of view.

Empirical access patterns

Latest comment: 6 years ago1 comment1 person in discussion

The ORES service has been online since July of 2015^[8]. Since then, we have seen steadily rising usage as we've developed and deployed new models. Currently, ORES support 66 different models for 33 different language-specific wikis.

Generally, we see 50 to 125 requests per minute from external tools that are using ORES' predictions (excluding the MediaWiki extension that is more difficult to track). Sometimes these external requests will burst up to 400-500 requests per second. Figure ?? shows the periodic and bursty nature of scoring requests received by the ORES service. Note that every day at about 11:40 UTC, the request rate jumps as some batch scoring job--most likely a bot.

Figure ?? shows our rate of precaching requests coming from our own systems. This graph roughly reflects the rate of edits that are happening to all of the wikis that we support since we'll start a scoring job for nearly every edit as it happens. Note that the number of precaching requests is about an order of magnitude higher than our known external score request rate. This is expected since Wikipedia editors and the tools they use will not request a score for every single revisions. It is the computational price that we pay to attain a high cache hit rate and to ensure that our users get the quickest response possible for the scores that they do need.

↑ https://uwsgi-docs.readthedocs.io/en/latest/
↑ http://www.celeryproject.org/
↑ MW:API
↑ en:High_availability#"Nines"
↑ https://quarry.wmflabs.org/query/14033
↑ Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.
↑ https://redis.io/
↑ See our announcement in Nov. 2015: https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/

EpochFail (talk) 17:24, 12 April 2018 (UTC)Reply

Design rationale

Latest comment: 6 years ago4 comments3 people in discussion

This is largely adapted from Jmorgan's notes.

Wikipedia as a genre ecology. Unlike traditional mass-scale projects, Wikipedia's structure and processes are not centrally planned. Wikipedia's system functions as a heterogeneous assemblage of humans, practices, policies, and software. Wikipedia is an open system and its processes are dynamic, complex, and non-deterministic.

A theoretical framework that accounts for the totality of factors and their relationships is essential to building a system-level understanding of state and change processes. Genre ecologies^[1] give us such a framework. A genre ecology consists of “an interrelated group of genres (artifact types and the interpretive habits that have developed around them) used to jointly mediate the activities that allow people to accomplish complex objectives.”^[2].

Morgan & Zachry (2010) used genre ecologies to characterize the relationships between Wikipedia’s official policies and essays--unofficial rules, best practices, and editing advice documents that are created by editors in order to contextualize, clarify, and contradict policies. Their research demonstrated that on Wikipedia, essays and policies not only co-exist, but interact. The “proper” interpretation of Wikipedia’s official Civility policy^[3] within a particular context is mediated by the guidance provided in the related essay No Angry Mastodons^[4].

In genre ecology terms, performing the work of enforcing civil behavior on Wikipedia is mediated by a dynamic equilibrium between the guidance provided in the official policy and the guidance provided in any related essays, with the unofficial genres providing interpretive flexibility in the application of official rules to local circumstances as well as challenging and re-interpreting official ideologies and objectives.

Algorithmic systems clearly have a role in mediating the policy, values, and rules in social spaces as well^[5]. When looking at Wikipedia's articulation work through the genre ecology lens, it's clear that robots mediate the meaning of policies (c.f., Sinebot's enforcement of the signature policy^[6]) and human-computation software mediates the way that Wikipedia enacts quality controls (c.f., the Huggle's vision of quality in Wikipedia as separating good from bad^[7]).

Wikipedia's problems in automated mediation Wikipedia has a long-standing historic problem with regards to how quality control is enacted. In 2006, when Wikipedia was growing exponentially, the volunteers who managed quality control processes were overwhelmed and they turned to software agents to help make their process more efficient^[8]. But the software they developed and appropriate only focused on reifying quality standards and not on good community management practices^[9]. The result was a sudden decline in the retention of new editors in Wikipedia and a threat to the core values of the project.

Past work has described these problems as systemic and related to dominant shared-understandings embedded in policies, processes, and software agents^[10]. Quality control itself is a distributed cognition system that emerged based on community needs and volunteer priorities^[11]. So, where does change come from in such a system -- where problematic assumptions have been embedded in the mediation of policy and the design of software for over a decade? Or maybe more generally, how does deep change take place in a genre ecology?

Making change is complicated by the distributed nature

Since the publication of a seminal report about the declining retention in Wikipedia, knowledge that Wikipedia's quality control practices are problematic and at the heart of a existential problem for the project have become widespread. Several initiatives have been started that are intended to improve socialization practices (e.g. the Teahouse, a question and answer space for newcomers^[12] and outreach efforts like Inspire Campaigns eliciting ideas from contributors on the margins of the community). Such initiatives can show substantial gains under controlled experimentation^[13].

However, the process of quality control itself has remained largely unchanged. This assemblage of mindsets, policies, practices, and software prioritizes quality/efficiency and does so effectively (cite: Levee paper and Snuggle paper). To move beyond the current state of quality control, we need alternatives to the existing mode of seeing and acting within Wikipedia.

While it’s tempting to conclude that we just need to fix quality control, it’s not at all apparent what a better quality control would look like. Worse, even if we did, how does one cause systemic change in a distributed system like Wikipedia? Harding and Harraway’s concept of successors^[14]^[15] gives us insight into how we might think about the development of new software/process/policy components. Past work has explored specifically developing a successor view that prioritizes the support of new editors in Wikipedia over the efficiency of quality control^[16]^[17], but a single point rarely changes the direction of an entire conversation, so change is still elusive.

Given past efforts to improve the situation for newcomers^[18] and the general interest among Wikipedia's quality control workers toward improving socialization^[19], we know that there is general interest in balancing quality/efficiency and diversity/welcomingness more effectively. So where are these designers who incorporate this expanded set of values? How to we help them bring forward their alternatives? How do we help them re-mediate Wikipedia’s policies and values through their lens? How do we support the development of more successors.

Expanding the margins of the ecology

Successors come from the margin -- they represent non-dominant values and engage in the re-mediation of articulation. We believe that history suggests that such successors are a primary means to change in an open ecology like Wikipedia. For anyone looking to enact a new view of quality control into the designs of a software system, there’s a high barrier to entry -- the development of a realtime machine prediction model. Without exception, all of the critical, high efficiency quality control systems that keep Wikipedia clean of vandalism and other damage employ a machine prediction model for highlighting the edits that are most likely to be bad. For example, Huggle^[20] and STiki^[21] use a machine prediction models to highlight likely damaging edits for human reviews. ClueBot NG^[22] uses a machine prediction model to automatically revert edits that are highly likely to be damaging. These automated tools and their users work to employ a multi-stage filter that quickly and efficiently addresses vandalism^[23].

So, historically, the barrier to entry with regards to participating in the mediation of quality control policy was a deep understanding of machine classification models. Without this deep understanding, it wasn't possible to enact an alternative view of how quality controls should be while also accounting for efficiency and the need to scale. Notably, one of the key interventions in this area that did so was built by a computer scientist^[24].

The result is a dominance of a certain type of individual -- a computer scientist (stereotypically, with an eye towards efficiency and with lesser interest in messy human interaction). This high barrier to entry and peculiar in-group has exacerbated a minimized margin and a supreme dominance of the authority of quality control regimes that were largely developed in 2006 -- long before the social costs of efficient quality control were understood.

If the openness of this space to the development of successors (the re-mediation of quality control) is limited by a rare literacy, then we have two options for expanding the margins beyond the current authorities: (1) increase general literacy around machine classification techniques or (2) remove the need to deeply understand practical machine learning in order to develop an effective quality control tool.

Through the development of ORES, we seek to reify the latter. By deploy a high-availability machine prediction service and engaging in basic outreach efforts, we intend to dramatically lower the barriers to the development of successors. We hope that by opening the margin to alternative visions of what quality control and newcomer socialization in Wikipedia should look like, we also open the doors to participation of alternative views in the genre ecology around quality control. If we’re successful, we’ll see new conversations about how algorithmic tools affect editing dynamics. We’ll see new types of tools take advantage of these resources (implementing alternative visions).

↑ ?

↑ (Spinuzzi & Zachry, 2000)

↑ en:WP:CIVIL

↑ en:WP:MASTODON

↑ Lessig's Code is Law

↑ Lives of bots

↑ Snuggle paper

↑ R:The Rise and Decline paper

↑ Snuggle paper

↑ Banning of a vandal

↑ Teahouse CSCW paper

↑ Teahouse Opensym paper

↑ Haraway, D. 1988. “Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective.” Feminist Studies, Vol. 14, No.3. (Autumn, 1988), pp. 575-599.

↑ Harding, S. 1987. The Science Question in Feminism. Ithaca: Cornell University Press.

↑ Snuggle paper

↑ Geiger, R.S. (2014, October 22-24). Successor systems: the role of reflexive algorithms in enacting ideological critique. Paper presented at Internet Research 15: The 15th Annual Meeting of the Association of Internet Researchers. Daegu, Korea: AoIR. Retrieved from http://spir.aoir.org.

↑ Teahouse CSCW paper

↑ Snuggle paper

↑ en:WP:Snuggle

↑ en:WP:STiki

↑ en:User:ClueBot NG

↑ When the levee breaks

↑ Snuggle paper

EpochFail (talk) 23:39, 16 April 2018 (UTC)Reply

@EpochFail: This is excellent. I made two very small textual changes. There's one additional piece of argument that you might want to add. Starting in the 4th paragraph from the end, you start to describe barriers to participation in quality control. You discuss the technical/expertise barrier around implementing machine learning systems, and I agree that is very important. I think it would also be useful to discuss the ADDITIONAL barrier created by the systems and practices that have developed around the use of these models. Could you argue, for example, that the existing models prioritize recall over precision in vandalism detection, and ignore editor intent, and that this is because those design decisions reflect a particular set of values (or a mindset) related to quality control? People who don't share that mindset--people who are more interested in mentoring new editors, or who care about the negative impacts of being reverted on new editor retention--won't use these tools because they don't share the values and assumptions embedded in the tools. By creating alternative models that embed different values--through interpretability, adjustable thresholds, and "good faith" scores--you provide incentives for folks who were previously marginalized from participating in quality control. Thoughts? Jmorgan (WMF) (talk) 17:19, 17 April 2018 (UTC)Reply

I’m trying to catch up with the genre ecologies reading, and a first impression is that genre diagrams have a lot in common with data flow diagrams. The edges contain a process, and the nodes might contain multiple data stores. I appreciate that the genre theory is giving us a more zoomed-out perspective, in which human behaviors like habits and culture begin to emerge. From my quick browsing of the background work on genre ecology, I think you’re breaking ground by suggesting that machines mediate in this space as well, in order words considering the data flows which become invisible because they don’t generate genres. For example, editors will read the genre of ORES scores via a UI, and their administrative actions create a record of reverts, but we must account for the mostly automatic process of training a ML model on the reverts and updating scores, which changes the network topology into a feedback loop. I’d appreciate help freeing myself of my data flow interpretation on genre ecologies, at some point. If machine mediation is something new in genre ecologies, then I’m curious about what we gain by bringing in this theory.

Great to see the focus on effecting change! I personally agree wholeheartedly that “successors come from the margin”, that we could design interventions all day and the results might even be quite positive, but that the most just, lasting, and visionary change will come from empowering our stakeholders to “let a hundred algorithms bloom”, and we may be able to catalyze this by creating space at the margins.

Not sure we need to present a stereotypical computer programmer who prefers determinism and logic to messy humans. It feels like a straw dog, although I won’t deny I’ve heard those exact words at the lunch table… Maybe better to just point out how simplistic solutions are seductive, and are encouraged by techie culture.

I want to hear more about how we’re opening the margins. So far, I’m left with the suggestion that JADE will allow patrollers to push our models in new directions without ML-expert mediation. This won’t be the obvious conclusion for most readers, I’m guessing, and I’d love to see this conclusion expanded. Adamw (talk) 05:38, 17 April 2018 (UTC)Reply

First, I'm not sure I can address your thoughts re. process diagrams. I'm personally not as interested in actually modeling out the ecology as much as using the framework to communicate effectively about general dynamics. Maybe Jmorgan has some thoughts.

I love how you put this:

we could design interventions all day and the results might even be quite positive, but that the most just, lasting, and visionary change will come from empowering our stakeholders to “let a hundred algorithms bloom”, and we may be able to catalyze this by creating space at the margins.

When I'm thinking about margins, I'm imagining the vast space for re-mediation of quality control process without pushing the prediction models at all -- just making use of them in novel ways. I think that one does not have to fully open the world in order for effective openness to happen in a marginal sense. Though still, I do think there's going to be some interesting future work potential around making the prediction models more malleable. In the end, if there's a single shared model for "damaging" then that model will represent an authority and not a marginal perspective. We'd instead need to allow multiple damaging models if we were to support marginal activities at that level. EpochFail (talk) 14:09, 17 April 2018 (UTC)Reply

ORES system: Open, transparent process

Our goals in the development of ORES and the deployment of models is to keep the process -- the flow of data from random samples to model training and evaluation open for review, critique, and iteration. In this section, we'll describe how we implemented transparent replay-ability in our model development process and how ORES outputs a wealth of useful and nuanced information for users. By making this detailed information available to users and developers, we hope to enable flexibility and power in the evaluation and use of ORES predictions for novel purposes.

Gathering labeled data

There are two primary strategies for gathering labeled data for ORES' models: found traces and manual labels.

Found traces. For many models, there are already a rich set of digital traces that can be assumed to reflect a useful human judgement. For example, in Wikipedia, it's very common that damaging edits will be reverted and that good edits will not be reverted. Thus the revert action (and remaining traces) can be used to assume that the reverted edit is damaging. We have developed a re-usable script^[1] that when given a sample of edits, will label the edits as "reverted_for_damage" or not based on a set of constraints: edit was reverted within 48 hours, the reverting editor was not the same person, and the edit was not restored by another editor.

However, this "reverted_for_damage" label is problematic in that many edits are reverted not because they are damaging but because they are involved in some content dispute. Also, the label does not differentiate damage that is a good-faith mistake from damage that is intentional vandalism. So in the case of damage prediction models, we'll only make use of the "reverted_for_damage" label when manually labeled data is not available.

Another case of found traces is article quality assessments -- named "wp10" after the Wikipedia 1.0 assessment process originated the article quality assessment scale^[2]. We follow the process developed by Warncke-Wang et al.^[3] to extract the revision of an article that was current at the time of an assessment. Many other wikis employ a similar process of article quality labeling (e.g. French Wikipedia and Russian Wikipedia), so we can use the same script to extract their assessments with some localization^[4]. However other wikis either do not apply the same labeling scheme consistently or at all and manual labeling is our only option.

Manual labeling. We hold manual labels for the purposes of training a model to replicate a specific human judgement as a gold standard. This contrasts with found data that is much easier to come by when it is available. Manual labeling is expensive upfront from a human labor hours perspective. In order to minimize the investment of time among our collaborators (mostly volunteer Wikipedians), we've developed a system called "Wiki labels"^[5]. Wiki labels allows Wikipedians to submit judgments of specific samples of Wiki content using a convenient interface and logging in via their Wikipedia account.

To supplement our models of edit quality, we replace the models based on found "reverted_for_damage" traces with manual judgments where we specifically ask labelers to distinguish "damaging"/good from "good-faith"/vandalism. Using these labels we can build two separate models of that can allow users to filter for edits that are likely to be good-faith mistakes^[6], to just focus on vandalism, or to focus on all damaging edits broadly.

We've managed to complete manual labeling campaigns article quality for Turkish and Arabic Wikipedia (wp10) as well as item quality in Wikidata. We've found that, when working with manually labeled data, we can attain relatively high levels of fitness with 150 observations per quality class.

Explicit pipelines

One of our openness goals with regards to how prediction models are trained and deployed in ORES involves making the whole data flow process clear. Consider the following code that represents a common pattern from our model-building Makefiles:

datasets/enwiki.human_labeled_revisions.20k_2015.json:
        ./utility fetch_labels \
                https://labels.wmflabs.org/campaigns/enwiki/4/ > $@

datasets/enwiki.labeled_revisions.w_cache.20k_2015.json: \
                datasets/enwiki.labeled_revisions.20k_2015.json
        cat $< | \
        revscoring extract \
                editquality.feature_lists.enwiki.damaging \
                --host https://en.wikipedia.org \
                --extractor $(max_extractors) \
                --verbose > $@

models/enwiki.damaging.gradient_boosting.model: \
                datasets/enwiki.labeled_revisions.w_cache.20k_2015.json
        cat $^ | \
        revscoring cv_train \
                revscoring.scoring.models.GradientBoosting \
                editquality.feature_lists.enwiki.damaging \
                damaging \
                --version=$(damaging_major_minor).0 \
                (... model parameters ...)
                --center --scale > $@

Essentially, this code helps someone determine where the labeled data comes from (manually labeled via the Wiki Labels system). It makes it clear how features are extracted (using the utility revscoring extract and the enwiki.damaging feature set). Finally, this dataset with extracted features is used to cross-validate and train a model predicting the damaging label and a serialized version of that model is written to a file. A user could clone this repository, install the set of requirements, and run "make enwiki_models" and expect that all of the data-pipeline would be reproduced.

By explicitly using public resources and releasing our utilities and Makefile source code under an open license (MIT), we have essential implemented a turn-key process for replicating our model building and evaluation pipeline. A developer can review this pipeline for issues knowing that they are not missing a step of the process because all steps are captured in the Makefile. They can also build on the process (e.g. add new features) incrementally and restart the pipeline. In our own experience, this explicit pipeline is extremely useful for identifying the origin of our own model building bugs and for making incremental improvements to ORES' models.

At the very base of our Makefile, a user can run "make models" to rebuild all of the models of a certain type. We regularly perform this process ourselves to ensure that the Makefile is an accurate representation of the data flow pipeline. Performing complete rebuild is essential when a breaking change is made to one of our libraries. The resulting serialized models are saved to the source code repository so that a developer can review the history of any specific model and even experiment with generating scores using old model versions.

Model information

In order to use a model effectively in practice, a user needs to know what to expect from model performance. E.g. how often is it that when an edit is predicted to be "damaging" it actually is? (precision) or what proportion of damaging edits should I expect will be caught by the model? (recall) The target metric of an operational concern depends strongly on the intended use of the model. Given that our goal with ORES is to allow people to experiment with the use and reflection of prediction models in novel ways, we sought to build an general model information strategy.

https://ores.wikimedia.org/v3/scores/enwiki/?model_info&models=damaging returns:

      "damaging": {
        "type": "GradientBoosting",
        "version": "0.4.0",
        "environment": {"machine": "x86_64", ...},
        "params": {center": true, "init": null, "label_weights": {"true": 10},
                   "labels": [true, false], "learning_rate": 0.01, "min_samples_leaf": 1,
                   ...},
        "statistics": {
          "counts": {"labels": {"false": 18702, "true": 743},
                     "n": 19445,
                     "predictions": {"false": {"false": 17989, "true": 713},
                                     "true": {"false": 331, "true": 412}}},
          "precision": {"labels": {"false": 0.984, "true": 0.34},
                        "macro": 0.662, "micro": 0.962},
          "recall": {"labels": {"false": 0.962, "true": 0.555},
                     "macro": 0.758, "micro": 0.948},
          "pr_auc": {"labels": {"false": 0.997, "true": 0.445},
                     "macro": 0.721, "micro": 0.978},
          "roc_auc": {"labels": {"false": 0.923, "true": 0.923},
                      "macro": 0.923, "micro": 0.923},
          ...
        }
      }

The output captured in Figure ?? shows a heavily trimmed JSON (human- and machine-readable) output of model_info for the "damaging" model in English Wikipedia. Note that many fields have been trimmed in the interest of space with an ellipsis ("..."). What remains gives a taste of what information is available. Specifically, there's structured data about what kind of model is being used, how it is parameterized, the computing environment used for training, the size of the train/test set, the basic set of fitness metrics, and a version number so that secondary caches know when to invalidate old scores. A developer using an ORES model in their tools can use these fitness metrics to make decisions about whether or not a model is appropriate and to report to users what fitness they might expect at a given confidence threshold.

The scores

The predictions made by through ORES are also, of course, human- and machine-readable. In general, our classifiers will report a specific prediction along with a set of probability (likelihood) for each class. Consider article quality (wp10) prediction output in figure ??.

https://ores.wikimedia.org/v3/scores/enwiki/34234210/wp10 returns

        "wp10": {
          "score": {
            "prediction": "Start",
            "probability": {
              "FA": 0.0032931301528326693,
              "GA": 0.005852955431273448,
              "B": 0.060623380484537165,
              "C": 0.01991363271632328,
              "Start": 0.7543301344435299,
              "Stub": 0.15598676677150375
            }
          }
        }

A developer making use of a prediction like this may choose to present the raw prediction "Start" (one of the lower quality classes) to users or to implement some visualization of the probability distribution across predicted classed (75% Start, 16% Stub, etc.). They might even choose to build an aggregate metric that weights the quality classes by their prediction weight (e.g. Ross's student support interface^[7] or the weighted_sum metric from ^[8]).

Threshold optimization

Latest comment: 6 years ago1 comment1 person in discussion

(import from other thread/essay)

↑ see "autolabel" in https://github.com/wiki-ai/editquality
↑ en:WP:WP10
↑ Warncke-Wang, Morten (2017): English Wikipedia Quality Asssessment Dataset. figshare. Fileset. https://doi.org/10.6084/m9.figshare.1375406.v2
↑ see the "extract_labelings" utility in https://github.com/wiki-ai/articlequality
↑ Wiki labels
↑ see our report Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04
↑ Sage Ross, Structural completeness
↑ Keilana Effect paper

EpochFail (talk) 15:38, 17 April 2018 (UTC)Reply

Simplifying occurrences of "reification"

Latest comment: 6 years ago1 comment1 person in discussion

I had a question about the use of “reification”, whether there’s a specific background or reason for using that word? If not, I find it distracting and maybe wrong… AFAICT, reification is more about ideas becoming a thing in people’s minds, rather than ideas actually turning “real”/“physical”. I’m thinking we can say “values are embodied in technologies” rather than reified…. "embedded" as you have elsewhere in the paper works for me as well. Or maybe “brought to life”. Adamw (talk) 17:28, 19 April 2018 (UTC)Reply

[1] ttps://github.com/he7d3r/mw-gadget-ScoredRevisions

[2] ORES review tool

[3] ttps://stats.wikimedia.org/EN/TablesWikipediansEditsGt5.htm

[4] Notably, Huggle adopted ORES prediction models soon after we deployed

[5] RTRC

[6] t:Wikipédia:Scripts/FastButtons

[7] :WP:RATER

[8] wikipedia:User:SuggestBot<ref><ref>Cite suggestbot papers

[9] Edit Review Improvements/New filters for edit review

[10] The cluebot paper

[11] s:Usuario:PatruBOT

[12] :User:Dexbot

[13] ttps://wikiedu.org/

[14] ttps://wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/

[15] ttp://socio-technologist.blogspot.com/2016/01/notes-on-writing-wikipedia-vandalism.html

[16] ttps://wikidata.org

[17] Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.

[18] t:Progetto:Patrolling/ORES>

[19] s:Wikipedia:Mantenimiento/Revisión_de_errores_de_PatruBOT/Análisis

[20] [[:es:Wikipedia:Café/Archivo/Miscelánea/Actual#Parada_de_PatruBOT

[21] ttps://uwsgi-docs.readthedocs.io/en/latest/

[22] ttp://www.celeryproject.org/

[23] MW:API

[24] :High_availability#"Nines"

[25] ttps://quarry.wmflabs.org/query/14033

[26] Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.

[27] ttps://redis.io/

[28] See our announcement in Nov. 2015: https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/

[53] see "autolabel" in https://github.com/wiki-ai/editquality

[54] :WP:WP10

[55] Warncke-Wang, Morten (2017): English Wikipedia Quality Asssessment Dataset. figshare. Fileset. https://doi.org/10.6084/m9.figshare.1375406.v2

[56] see the "extract_labelings" utility in https://github.com/wiki-ai/articlequality

[57] Wiki labels

[58] see our report Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04

[59] Sage Ross, Structural completeness

[60] Keilana Effect paper

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[1]

[1]

[2]

[3]

[4]

[5]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

It's alive!

"Technology is lagging behind social progress"

nudges, A/B tests, and "the ineffectiveness of Growth experiments"

Progress catalyst: Standpoints that haven't been operationalized now can be

Accountability of algorithms

Our role as technologists: Are we just encoding our own ideologies?

Discussion about disabling Flow board

Flow

Been away for a while.

Discussion of integrating ORES into Draft review

Reinterpretation of the meaning of ORES predictions

Preserving the margins on digital platforms

Title ideas -- what *is* ORES anyway?

ORES system (Dependency injection & Interrogability)

Dependency injection

Interrogability

Observation: Adoption patterns

Showcase tools

Adoption in current tools

New tools

TODO: ORES system (Threshold optimizations)

Re-mediating what?

Talk about proto-jade (misclassification pages)

Case studies

Report mistakes (Wikidata)

Patrolling/ORES (Italian Wikipedia)

PatruBOT (Spanish Wikipedia)

Discussion

ORES system overview

Horizontal scaling

Robustness

Batch processing

Single score processing

Empirical access patterns

Design rationale

ORES system: Open, transparent process

Gathering labeled data

Explicit pipelines

Model information

The scores

Threshold optimization

Simplifying occurrences of "reification"

Title ideas -- what is ORES anyway?