Grants talk:PEG/Grant Advisory Committee/Workroom

Please leave your questions or comments here!

Evaluation rubric

Latest comment: 10 years ago8 comments2 people in discussion

Since I was the person to push for a trial of this kind of system—in the GAC phone conversation several weeks ago—I suppose I should offer some comments. I should say that I told Asaf Bartov on the phone two days ago that I've since had a number of doubts about whether it will work: he has his own set of cautions about it, as well he might, but is willing to trial it.

There are several problems with the table, mostly related to a clash between fine- and coarse-grained detail at various places in the pipeline, and a related issue of an unnecessarily complex and long table.

Rubric 1: Impact potential

Does it fit with Wikimedia's strategic priorities?
Does it have potential for impact on Wikimedia projects?
Can we measure success?

"Measures of success" (third bullet) is a black spot in a high proportion of proposals. It's very important for spelling out the expected added value of a project, for judging post-project reports, and for deriving lessons learned. But a section that is largely done poorly in applications is mixed in with "fit with Wikimedia's strategic priorities" (shouldn't it be "PEG strategic priorities"?). Most projects will fit naturally with the priorities (without the explicit and largely redundant pleading the current form forces them into); the priorities are cast in wide terms, even though not as wide as those of the parent priorities.

So the all-important measures of success will be buried in a composite score, the other components of which will normally be rated relatively high. I don't believe this is useful; indeed, it's going to lead to misleading messages.

The "evaluation criteria" that work best here are "Ability to execute" and "Community engagement": their three points are more homogenous. Single numbers for "Budget quality" and "Impact potential" lose critical internal information through a single number, and I find it unfortunate that the staff will have to guess what some reviewers really think in these cases, unless the reviewers are explicit in writing.

This is a more general problem in the table (see below): explicit narrow-scoped "may" descriptors risk being misleading and the job of making sense of them will be pretty overwhelming for applicants (budget may be "significantly over-or-under budgeted" ... well, which is it, over or under—or perhaps that number represents a cluster of different issue about our budget? Head is spinning).

PS I've just copy-edited the PEG priorities, inter alia improving their second-language-speaker friendliness—are they finished, because they look a little raw? Could the points be numbered so that applicants and reviewers can refer to them more easily?).

Thanks. Yes, we are currently working on this page. Alex Wang (WMF) (talk) 22:27, 30 July 2014 (UTC)Reply

Scoring descriptors

This is based on the FDC's rubric text and structure. I don't mind the huge amount of text the FDC faces: theirs is a more elaborate system with much more funding at stake, fewer applicants, and an intensive twice-yearly in-person meeting.

But here, in some ways the hundreds of words in the alignment descriptors hinder the task of both reviewers and applicants. Perhaps they're intended to feed possible cues to reviewers (they do), but these "may" details are themselves cherry-picked from within the three "considerations" of each criterion—sometimes from more than one consideration—and are likely to distort the meaning of a single number. Reviewers should not need these arbitrary cues, and remember that they're being forced to "average" several criteria in the first place. Frustrating for them.

I also find just two numerical anchors (1 and 10) inadequate. 5 might mean "pretty bad" to one person, "bare pass" to another person, and "pass, reasonably ok" to another; it's a wide spectrum to leave vacant. Since the alignment descriptions are so arbitrary, I'd prefer to dump them and make the table a lot simpler for everyone (the details will have to be set out in writing by either individual reviewers or the staff in summary to interpret the numerical averages anyway, so what is lost?).

I suggest two changes. First, replacing the two-anchor numerical scale with a five-anchor one:

1 = very weak or no alignment
3 = weak alignment
5 = passable alignment; significant improvements still required
7 = reasonably good alignment
10 = excellent alignment

That, to me, clarifies the numerical spectrum with suficient fine-grainedness, so that at least everyone knows where they stand on the value of the numbers. It might also encourage reviewers to use the full numerical range, which would be a good outcome.

Second, going vertically, I would have reviewers write more numbers. That would at least allow them to be more explicit in important ways without writing out comments (I do not believe GAC members will by and large write a medium-sized essay as well as inserting numbers). Here are my suggestions, which at least separate out the very distintive considerations currently mixed into the soup:

Fit with the PEG strategic priorities
Potential for impact on Wikimedia projects
Ability to measure success
Ability to execute: appropriate scope, skills, and external factors
Budget match to the program scope
Strategic justification for material costs (printing, merchandise, etc.)
Does the budget reflect responsible growth for repeat grantees or a reasonable first investment for new grantees?
Community engagement, support, and promotion of diversity

I thought about "promotion of diversity" as a separate one, but on balance it might be better conflated into community engagement. I don't know. This is more like what reviewers want to express, and applicants need to know. It would make the explanatory task easier, too. It might have a chance of working. Perhaps you think it's important to keep the upper thematic tier intact ("Evaluation criteria"). Maybe, but it's yet more complication.

Some of the bullets in the weak and strong cells for "Budget quality" seem to have been mislocated in the table.

Tony (talk) 04:19, 24 July 2014 (UTC)Reply

I see that no one has bothered to fix the bullets in the evaluation rubric (see my comment above); in addition, the feedback template letters are ABBC. I cannot see how the scores can be anything but misleading, given that they will silently account for three different bullets, without written comments. Tony (talk) 04:41, 10 August 2014 (UTC)Reply

Hi Tony. We have been quite busy with Wikimania, but appreciate your comments above. I'll be reviewing the comments and considering changes to the rubrics next week. Alex Wang (WMF) (talk) 10:15, 10 August 2014 (UTC)Reply

Hi Tony. Please note we have reviewed your recommendations and made a number of changes to the evaluation rubric, including the following:

Pulled out "measures of success" to be its own evaluation criteria.
Deleted descriptors.
Replaced two-anchor numerical scale with five-anchor one.
Added comment boxes for each evaluation criteria.

Thanks for your help. Alex Wang (WMF) (talk) 18:25, 18 August 2014 (UTC)Reply

Alex, the revised table is much better.

You might consider rewording the first bullet in A from "Does it focus on one (not all) of Wikimedia's strategic priorities?" to "Does it focus on at least one of Wikimedia's strategic priorities?". This removes the implications that focusing on one is a particular plus, and that it might be a disadvantage to focus on all.

PEG criterion 2 starts" "The potential for impact in the targeted Wikimedia projects(e.g. Spanish Wikipedia, Wikimedia Commons)." This seems to be a significant omission from the rubric. Tony (talk) 02:53, 20 August 2014 (UTC)Reply

We actually want to have grantees focus on one, not all strategic priorities. This will be made clear in the new application form which will be ready the week of September 8th (unfortunately we've had some delay due to the conversation around Global Metrics. The potential for impact on Wikimedia projects is addressed in the rubric -- Section A, second bullet. Alex Wang (WMF) (talk) 22:28, 20 August 2014 (UTC)Reply

Subsequent reflections

Alex, as you've probably noticed, since writing the long post above I've tried a scoring table on the talkpages of three current PEG applications, here, here, and here. These were designed to overcome the problem of the "silent" bullet points in the current rubric overleaf, which is fine when the same people do both summation scoring and textual interpretation—as in the FDC staff assessments—where they themselves can explain how the composite score was arrived at. But here, different people would perform each of these two functions, which I think is not going to work. So, all of the criteria are top-ranked in my experimental table, so each score points to a narrower range of criteria (or a single criterion).

I have some comments from this experience:

It's unexpectedly hard work for the reviewer, who is forced (for the first time, in my case) to knuckle down and judge against precise criteria. This requires a big-picture purview, even when judging individual criteria; it's far easier just to make individual comments. It will probably not suit the casual fly-in fly-out, a technique I've used before myself—where you just read bits of the application and comment narrowly, half-intending to revisit and read and comment more thoroughly.
This higher level of discipline is an excellent thing in terms of critical output, but how many GAC members are prepared to do it? There is likely to be a phenomenon by which members scan through and quickly write-in scores, with no way of judging the engagement unless members wrote explicit comments, either on the gdrive spreadsheet or the public talkpage.
Even without the problem of the second-ranked "consideration" bullet points (overleaf), I've found it hard to score without writing explanatory notes. It's partly to make the scores more actionable by applicants, and partly to encourage applicants and contextualise the potentially negative impact of low-to-medium scores.
My scoring table is intended to appear early in the life of the application, to be part of the improvement process, something that probably wouldn't work for a committee-driven process (unless there was a shiny blue tech system for displaying ongoing averaged scores on the page—would be great, but unlikely without an IEG project to develop it). As currently envisaged, GAC scores would come towards the end of the process, so they won't help in terms of improvement by showing applicants where their application is weak and strong: presumably, the GAC scores will inform staff and be communicated to applicants after the fact in information-degraded form (averaged, anonymous, unclear judgements of silent considerations).

I don't have answers—not to say that there aren't answers. The four points I've made above might assist in the design. May I suggest that GAC members be explicitly able to write in scores for only some of the rubrics if they wish—that might avoid discouraging members from either not contributing to an application at all, or doing so superficially. Like FAC on en.WP, it's perfectly fine for a reviewer to comment on one of the many criteria. But you'd need to tell the applicants how many scores are being averaged for each criterion. If I were an applicant, I'd want to know the standard deviation, although this would require a little understanding of stats and preferably more than just a few scores.

I'm not sold on the notion that scores need to be anonymous. Tony (talk) 00:09, 11 August 2014 (UTC)Reply