Distributed Peer Review

A new method for evaluating research project proposals shows us how much we don't know about collective decision making.

Sep 13, 2025

Self-promotion alert: this is my commentary from the perspective of a judgement and decision researcher on work we’ve been doing recently at the Research on Research Institute (RoRI). It is not a summary of the results or an expression of the shared view of the research team or of funder partner.

Deciding who gets research funding is a high-stakes problem. Not only do you direct what topics are researched, and how, but you affect research careers. People who win funding gain promotions, influence, and better chances of future funding. People who miss out may drop out of research altogether.

The decision is not just high-stakes, it is also fabulously difficult:

Different people want different things from research (make breakthroughs, address current societal issues, train researchers and more).
It’s a forecasting task - you’re trying to predict the future value the research will have if the project is attempted.
It’s a long term forecasting task - the payoff for research projects may be decades in the making.
There’s uncertainty. If you knew the outcome it wouldn’t be true research. Some projects will fail. Some will produce unexpected benefits. In effect, you have to take bets.
There’s ambiguity, even beyond the randomness of project success, for many research projects their value isn’t even clear after they are completed. There are legitimate disagreements over what the results mean, disagreements which may last a long long time, or may even be reopened long after the consensus was that they’d been resolved.
Projects are both specialised and incommensurable. Not only is it hard for an outsider to judge the value of a project on particular physics, or the sociology of medieval medicine, or immunohistochemistry, but there is no common unit of value which allows you to simply align projects from these specialisms on a common scale and easily decide which has the higher importance.

Previously on Reasonable People I’ve written about topics like power of deliberation, the wisdom of crowds and decision biases. To study these things experimentally psychologists often recruit volunteers (who may or may not be more invested in being paid to participate than in sincerely trying hard at the tasks set), and use artificial problems which often test some isolated aspect of reasoning or knowledge (and which usually have a right answer). Our own work on the Wason Selection Task is an example of this.

The task of deciding what research to fund is a million miles away from these artificial lab tasks, for all the reasons I’ve given above. That’s what makes it interesting as a decision scientist, and what excites me about the study of it - can we apply the experimental method, and lessons from decision science, to improve funding decisions?

This questions inspires the work I do with RoRI, and for the last year and a bit we’ve been working with the Volkswagen Foundation, a private research-funding foundation in Germany (and nothing to do with the car manufacturer), to evaluate their trial of an innovative method of funding evaluation: Distributed Peer Review.

Under Distributed Peer Review the evaluation of research proposals is done in parallel by all those who applied at the same time for funding. The principle of reciprocal reviewing is established for academic conference papers, particularly in computer science, but it isn’t established practice for research funding proposals, and nobody - to our knowledge - has done a formal side-by-side comparison of distributed peer review with the standard procedure. This comparison is what we report in our new working paper.

It has been a rewarding project and I think it clearly demonstrates the feasibility of Distributed Peer Review. We report the results and our interpretations in full in the working paper, and have summarised elsewhere, so I’m not going to dwell on these here. Rather, I’m going to give some personal reflections on the project, and say what I think the unanswered questions and opportunities for future research on research are.

Reviewers complete the impossible task

Given how I outlined the task of evaluating research proposals we might fairly class the task as impossible. Reviewers, however, routinely complete this task, and distributed peer review asks an even larger and wider set of reviewers than normal look at an application pool and to complete the impossible task. And they do, providing comments and scoring each proposal they see (4 or 5 per reviewer in our case). The fundamental self-organising principle of research prioritisation is respected, and given a democratic boost. The in-principle impossibility of reviewing doesn’t seem to be a problem for our host of reviewers, some of whom are being initiated into proposal evaluation for the first time. In a time where self-governance seems less and less respected, I find the experiment heartening.

Selection for funding is fundamentally unstable

Our comparison makes explicit what everyone in research funding already knew, or should have known: there’s a lot of luck in what gets funded. Not only did the Distributed Peer Review process not completely agree with the traditional selection process, we can show - analytically - that if we were to run the Distributed Peer Review process again it would very unlikely to select the same winning projects, just due to the random allocation of projects to reviewers and the variation in how reviewers score projects. Researchers working with the dutch research council NWO showed the same thing for the traditional approach last year, the agreement between two panels who see the same proposals is better than random chance, but not by much.

Some have read this to mean we should give up efforts to discriminate the best proposals and just hold a lottery for all those above some quality threshold (so called “partial randomisation”). I think heading in the other direction is just as legitimate: we should review and improve the mechanisms by which we do funding evaluation.

What is certain is that we should valorise winning funding less, and funders should anticipate that their current processes could be improved (for what it is worth, the funders I speak to do anticipate this, it is senior academics - those who have benefitted most from the current process for allocating funding - who are perceived to be most resistant to experimentation).

Reviewers are probably doing different things

The instability in the outcome of the whole process has its root in the variability of how reviewers score proposals. It is to be expected that different people make different judgements, but our analysis showed that variability between reviewers reflected an irreducible level of disagreement.

Let me explain. If we ask a bunch of people to estimate some quality - “how high is the Eiffel Tower”, for example we will get different answers but in many cases those answers will vary around the true answer. By averaging more and more individual answers we can get arbitrarily close to the true answer (if, and only if, each individual answer is independent and only differs from the true answer by some random variation which is equally likely to be wrong in either direction).

This ‘Wisdom of Crowds’ effect may work for simple estimation problems, it is not what is occurring with scoring of the funding proposals which we analysed. Here we were able to show that no level of additional judgements would ever produce a stable, final, average score for the proposals such that it would be possible to consistently rank them best to worst (or even achieve consistency in which were ranked in the top ten). Reviewers scores, for each proposal varied too much.

What this means isn’t entirely clear, but since we’re here and its now I will speculate. One possibility is that, underlying it all, the proposals do line up on a single scale of quality, but that the differences between them are so narrow that reviewers have a hard time discriminating the best from the very best.

The other possibility, and my bet, is that reviewers are bringing different standards to bear on their scores. Funding proposals are complex objects. However reviewers are instructed, it is natural for there to be variation between them in what they think is important - some emphasise methods, some possible impacts, some the reasons a project might fail. They deliver a single score, but the very criteria informing those scores vary between individuals, not just the numerical scores.

It’s an open question how much this could be reduced by reviewer training and/or instruction. My guess is, a lot, given how little training and how much ambiguity there is in funding agency instructions.

Another open question is whether it would be more realistic to focus on weeding out the most infeasible projects. The traditional view is that proposals are evaluated to select the best proposals (who get funding). Looking at the details shows that there isn’t a clear line between the best and the rest, and many of the rest, who happen not to get funding, could equally well have been funded. Maybe the design problem of funding evaluation would be more tractable if it aimed just to discriminate the proposals that can’t work - it might be the bottom 10% - from the rest. Asking reviewers to say which proposal is most excellent, innovative or impactful seems to introduce a fundamental difficulty in the very specification of the task to be achieved. Asking them to identify projects with critical flaws seems like it would be something it is easier to find agreement on.

We showed it can work, but not that it always works

Distributed Peer Review for research proposals originated in the telescope community, where researchers apply for limited time on a shared telescope, rather than limited funding. I know very little about astronomy, but I presume that this means the proposals form a homogeneous group - by definition they are all from the same field, and are all proposing research using the same instrument. UKRI trialed Distributed Peer Review on a call for fellowships to look at the social science of AI. Again, I would expect that the researchers applying to this would be relatively close in terms of expertise, and so equipped to provide reviews of each other’s work. The trial at Volkswagen Foundation was an interdisciplinary call in the social sciences, so inherent in the call was the expectation that researchers should be part of an interdisciplinary team and able to present their project in a way that was accessible across disciplines.

These two cases : disciplinary homogeneity and cross-disciplinary projects seem like they are two cases where Distributed Peer Review can work.

What about other cases, such as a large “standard call” for projects, which could be as broad as “economic and social research” or “medical research” or even “everything”? Here the question of how well you can map the expertise of reviewers against proposals looms very large. If I am the only researcher of ancient Egyptian hieroglyphics applying for funding, who will review my proposal?

At this point, I note that the traditional route of sending proposals out to external experts solves the expertise problem, but only at the cost of intensifying the potential conflict of interest (all the other ancient Egyptian hieroglyphics researchers know you, and probably either love or hate your work), as well as doing nothing to solve the incommensurability problem. If you have one proposal on ancient Egyptian hieroglyphics and another on historical landscape perspectives on snail evolution, say, and the two proposals are reviewed by a completely different set of people, then you have just delayed the decision about how to weigh them against each other, rather than solved it by sending the proposals to external review.

Following this line of thought, I think Distributed Peer Review shines a light on a set of really interesting questions about what expertise we want from review of research proposals. Can you be too expert to provide a review? I would argue that you can. Some topics may only have a handful of researchers working on them, and they are sure to be embedded in a deep collaboration or rivalry. There may be topics which are too specialised to allow evaluation by non-experts, but allowing experts to declare that their research topic is one of them seems too easy.

At the other end of the scale, having some expertise in a broad discipline, in how research and research projects proceed does seem like it would help you form a better judgment of a research proposal. Total ignorance of a discipline is not a good foundation (even the non-research stakeholders that some funders bring in to contribute to proposal evaluation have some expertise, such as being patients affected by a medical condition which is the topic of research).

As far as I can see, it is an open question what the optimal disciplinary distance from a proposal a reviewer should have - to be close enough not to blinded by technical details, but not so close as to be too conflicted to be able to judge the wider significance. For the process as a whole, you could imagine some kind of mix is optimal. With traditional review funders often have to rely on who is already on the panel, which ever external reviewers they can get. Distributed Peer Review, by expanding the reviewer pool, gives more options for determining this mix and testing the effects on evaluation.

(For what it is worth, in the Volkswagen Foundation trial, there was no association between reviewer expertise and project scores. Reviewers who were disciplinary-distant didn’t give higher or lower average scores than reviewers who were disciplinary close).

Strategic behaviour is an issue even if it is not a problem

One thing we heard again and again was concerns that Distributed Peer Review opened up funding evaluation to a range of strategic behaviours, with researchers able to game the system to advantage themselves. Like everything about Distributed Peer Review, the issue needs to be judged relative to the standard practice, where gaming is also possible (although perhaps less intensely incentivised). There are things a funder can do to mitigate against gaming (see our Guide to Distributed Peer Review for the briefest of summarises), but none of these are straightforward to implement and provide a guarantee against gaming. Because of this, mechanisms for discouraging gaming are an open research area in my opinion. An additional reason for this is that I have come to the conclusion that peoples’ perceptions about funding evaluation matter as much as the reality. By this I mean that it is important to be seen to be fair, as well as to be fair. For gaming, what this means is that even a evaluation design which prevented most forms of gaming could still be perceived as allowing gaming - some applicants might try and trick the system by awarding low scores, even if it couldn’t actually benefit them, and other applicants might worry they had been disadvantaged even if they hasn’t. This seems as real a problem as the presence of some actual gaming.

We couldn’t detect any evidence of strategic behaviour in the Volkswagen Foundation trial, and saw plenty of contrary evidence such as reviewers giving the highest scores to proposals. I came to the conclusion that strategic behaviour needs the motivation (to be willing to cheat), the incentive (the possibility it will work), and the information (you need to know how to award scores to advantage yourself without getting caught). The system used for the Volkswagen Foundation trial could, in theory, have been gamed (the incentive existed), but I believe most applicants completed the process in good faith (the motivation did not exist), and that all applicants lacked the right information on how and when to gaming the scores to reap any advantage.

Concerns about gaming weren’t raised as a major concern by applicants in our surveys and interviews, creating the paradoxical situation that although gaming may have been theoretically possible it wasn’t a concern, whereas there could be other schemes where gaming was theoretically harder, but would be more of a concern (because applicants believed it was).

Lots to do

I’ve tried to make clear why funding evaluation is a hard, interesting, problem, and also one which the psychology of judgement and decision making has something to contribute to, alongside more expected fields such as metascience, the study of innovation, and organisational research. There are other aspects, such as the treatment of bias, where psychology also has a lot to say which I didn’t touch on.

Bringing theoretical perspectives from decision making could have huge pay-offs, making funding allocation faster, more efficient, more effective and fairer. Along with this optimism, I recognise that there is no single obvious solution to the complex problem of how to fund research, the only way forward will require experimentation, Funders will need courage to choose to do that experimentation, and researchers will need courage to support them as they do it.

Our working paper on Distributed Peer Review:

Butters, A., Marshall, M. B., Pinfield, S., Stafford, T., Bondarenko, A., Neubauer, B., Nuske, R., Schwidlinski, P., & Denecke, H. (2025). Applicants as Reviewers: Evaluating the Risks, Benefits, and Potential of Distributed Peer Review for Grant Funding Allocations (RoRI Working Paper No. 17). https://doi.org/10.6084/m9.figshare.29994841.v2

Guide for Funders: Applicants as reviewers - a Guide to Distributed Peer Review

For a masterclass in funder experimentation and evaluation of its own processes, and one that is is deeply informed by psychology (because the author has a PhD in it) please see the report

"Improved application processing" from Stiftelsen Dam by their Chief Programme Officer Jan-Ole Hesselberg (2025)

For something recent from the UKRI’s Metascience Unit, see this interesting theoretical analysis which treats proposal selection as a signal detection problem, and one in which structural biases (e.g. against certain applicants) are confounded with the quality signal:

Hulkes, A., Brophy, C., & Steyn, B. (2025, August 28). Reliability, bias and randomisation in peer review: a simulation. https://doi.org/10.31235/osf.io/4gqce_v1

Community Notes update

In other news on distributed mechanisms of collective intelligence, the Meta Community Notes system has had some updates:

Techcrunch: Meta adds new features to Community Notes fact checks, including alerts for corrected posts

More, from me, on Community Notes: The Making of Community Notes

PODCAST: Philosophize This! Kafka and Totalitarianism (Arendt, Adorno)

Not a discussion - just host Stephen West considering the different interpretations Hannah Arendt and Theodore Adorno put on Kafka’s work. As well as drawing out the enduring value of Kafka’s work, it also puts a mirror up to more contemporary issues. In particular I’m thinking of Hannah Arendt’s comment on the core issue with refugees being that they are outside the political community, rather than they demand support, and the similarity between bureaucratic liberalism and totalitarian (which normally I would scoff at), in that both, inadvertently or deliberately, induce a feeling of voicelessness in the manged subjects.

Link: Episode #229 - Kafka and Totalitarianism (Arendt, Adorno)

PODCAST: Very Bad Wizards: Episode 312: MechaSkeptic

David and Tamler return to David Hume’s somewhat slippery brand of skepticism, this time focusing Chapter 12 of his Enquiry Concerning Human Understanding. Plus speaking of things to be skeptical about, we dive into a recent paper called “Your Brain on ChatGPT” – does neuroscience show that LLM users incur a “cognitive debt”?

tl;dr - folks, it does NOT show that LLM users incur a cognitive debt. Come for the debunk of this paper, but stay for discussion of Hume. I particularly enjoyed that Hume advances a powerful argument for skepticism and then confesses that all philosophising about skepticism is overcome by our native instinct to act:

Nature is always too strong for principle. And though a Pyrrhonian [skeptic] may throw himself or others into a momentary amazement and confusion by his profound reasonings; the first and most trivial event in life will put to flight all his doubts and scruples… When he awakes from his dream, he will be the first to join in the laugh against himself, and to confess, that all his objections are mere amusement, and can have no other tendency than to show the whimsical condition of mankind, who must act and reason and believe [even thought they cannot justify the foundations of those actions, reasons and beliefs]

More: Quote #327 : Nature is always too strong for principle

Link: Very Bad Wizards, Episode 312 MechaSkeptic

NEWSLETTER: Optimally Irrational, from Lionel Page

Start with Epstein files: how arguments really make people change political side

The discussion above highlights a key condition for public debate to be effective in uncovering truth and discarding bad ideas: contributors must be rewarded with social recognition for the quality of their arguments, not for their coalitional loyalty.

QUOTE: “ An LLM is designed to generate statistically likely responses to the question "What would an answer to this query sound like?”

Robert McNees very succinctly captures something important:

But that's not the only problem. Interactions with LLMs feel like a dialog, so it's natural to think the usual rules of conversation apply. You ask a question and expect the response will be an answer to that question. It's important to understand that this is not what's happening. An LLM is designed to generate statistically likely responses to the question "What would an answer to this query sound like?" This is not the same thing as answering the question. It might produce what you are looking for, or it might not. This is one reason why output from an LLM will sound authoritative even when it's wrong, and apologetic when mistakes are pointed out. It isn't authoritative or apologetic, and it isn't "thinking" about the question. These are just the sorts of responses that best fit a very complicated set of likelihood criteria

From here

More: my series on hooks for thinking about AI

Mike Caulfield: Is the LLM response wrong, or have you just failed to iterate it?

Mike argues that you get better factuality from LLMs if you press them to check their answers, and this is normal and good - just how we ourselves check our own research. Amongst other things, he identifies this kind of probing as very rare in the general population (and very teachable).

Link: Is the LLM response wrong, or have you just failed to iterate it?

Catch-up

How to outsmart a crowd of 5000 people in 4 minutes

Some personal news

Decoy effects in the purchase of 3,649,027 bottles of wine

Algorithmically-enhanced consensus seeking

… And finally

The Tanum rock carvings, a UNESCO world heritage site, are Bronze Age art in Sweden

Andrew Brown shared his translation of a Harry Martinson poem inspired by the rocks, please enjoy here: The silence of the rocks.

Thanks for reading Reasonable People! This post is public so feel free to share it.

END

Comments? Feedback? Gripes about peer review? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online

Jazzme

Sep 14

If you pitch a research proposal and it sounds good and you back up "it sounds good" with a little research of your own you should be able to come to a conclusion to fund or not to fund. Some research is a lot for risky than others but big risk could lead to greater advancement for us all.

Advancing the science is both labor and financial risky but here we are 2025 and much to be grateful for thru scientific research.

I predict as the morons in power here in the states impede research there is gonna be a huge brain drain here and movement of the smartest in STEM to China. Trump is making America fossilized.

Expand full comment

Reasonable People