How to outsmart a crowd of 5000 people in 4 minutes
An ingenious experiment shows the secret sauce need to improve on the wisdom of crowds
Many heads are better than one, they say.
A famous example was given by Francis Galton, Victorian polymath and eugenicist, who, in 1907, made a report to the journal Nature of the “estimates of the dressed weight of a particular living ox, made by 787 different persons.” at the West of England Fat Stock and Poultry Exhibition, held at Plymouth. He showed that the estimates of individuals varied around the true value, so by averaging, you could remove the noise of individual judgement and reach an improved estimate. For situations like this, the average estimate of the group can even beat the estimate of the best individual.
It’s important to note the kinds of situations where this ‘Wisdom of the Crowd’ effect holds. If people in the crowd all share the same bias, for example, averaging their answers averages the bias as well as whatever true information they hold, meaning the collective answer will be wrong. If the crowd copy each other in the answers they give - as you can imagine happening if people shout out answers or can otherwise observe each other - then you reduce the contribution of each individual’s unique knowledge, and introduce ‘herding’ behaviour, where everyone’s answer swings towards one particular person’s answer (which is unlikely to be right).
So the wisdom of crowds effect isn’t magic. The circumstances where it holds - and where it doesn’t - can be enumerated. That said, how averaging is often surprisingly effective at getting a good estimate. In the study of collective judgement the simple average is the baseline any proposed improvement must beat.
Which brings us to today’s paper, Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds by Joaquin Navajas and coauthors, published in 2018.
And what a crowd they had access to!

Set up
The participants in this experiment were 5180 people attending a TEDx event in Buenos Aires. Each person was given a A4 piece of paper and a pen. The speaker at the event, one of the paper coauthors, gave instructions from the stage.
First, everyone in the crowd was asked to estimate the answer to 8 questions. These are things which have a known correct answer, but which no individual is likely to know. Things like "What is the height of the Eiffel Tower?". Here’s the full set:
Next, the crowd divided into groups of 5 and discussed, for just 1 minute each, half of this set - four questions (the discussed questions were GOALS, ROULETTE, ALEGRIA, OIL BARREL).
Here’s a couple of shots of what that looked like (both taken from the companion video the team released along with the paper)
From the discussion a consensus answer was recorded, then also revised individual answers. The revised individual answers can tell us if people really changed their mind following discussion with others, and if they dissent from the consensus answer.
Results
The results of the initial, individual, estimates show how good the wisdom of crowds is - for some questions at least. The true height of the Eiffel Tower is 324m. The average of all individual guesses was 344.4. Pretty close! For other questions the average answer is pretty bad. There were 134 Roman emperors. The average of all individual guesses was 19.5. I assume this is a situation of a shared bias - people rely on their memory of famous emperors, and this causes them to dramatically underestimate. Averaging across many people with a shared bias can’t improve the score. Someone in the 5000 strong audience must have known the correct answer, or at least was close to correct, but the correct information in their answer is swamped by the biased information in the majority.
But what of the effect of deliberation?
The final analysis looked at 1400 people - 280 groups of five - because the full data was missing for many groups from the crowd of 5180. This sort of issue seems very hard to avoid when you’re doing an experiment with a live crowd at an event.
If the answers from averaging individuals were good, the average of the consensus answers was even better. Analysis showed that averaging just 4 group answers provided better estimates than the straight average of the answer from 1400 people. The implications for collective decision making are startling. Asking for discussion for just 1 minute means you can make a better estimate with approximately 1.4% of the people. A large efficiency gain!
The benefits of deliberation didn’t stop there. Analysis of individual’s post-discussion scores showed that individuals updated their answers, making the new mean of all individuals’ answers an even better estimate of the right answer than the mean of all group consensus answers.
Mechanism
Skeptics might think that the benefits of deliberation stem merely from having had a chance to reflect on the first answer. The test for this is the contrast between answers for discussed and non-discussed questions. Answers for both questions were given by individuals at the beginning and at the end, so both had equal chance for solitary reconsideration.
The comparison is clear: it is discussion which improves the second answer:
As a way into testing what it was that the group discussion actually did, the researchers compared the group answers to 7 different averaging rules: taking the straight average, taking the median, excluding outliers and taking the average and so on. The idea was that maybe group discussion just performed a statistical function which could be matched by the averaging rule. No averaging rule tested gave answers as good as the group consensus answer, suggesting that group - even in that single minute - is able to integrate and evaluate information in a more sophisticated way.
Finally, to gain some insight into what people did during that minute, the researchers ran the some procedure on a small group back in the lab - and this time were able to ask them what they did when they discussed. The single most endorsed answer was “We shared arguments and reasoned together”. So it seems that when asked to discuss people really did deliberate, rather than just follow the most confident group member or mechanically discard outlier answers.
The lab version also let the researchers confirm that averaging consensus answers was a shortcut to error reduction. Averaging across just four groups from the discussions in the lab beat the average answer from all 5180 people in the TEDx crowd.
What to make of this
This is what the authors say:
“Our simple-yet-powerful idea is that pooling knowledge from individuals who participated in independent debates reduces collective error.”
“This study opens up clear avenues for optimizing decision processes through reducing the number of required opinions to be aggregated”
There’s lots to like about this experiment: it was done live, with a very large audience, and that means that the procedure must be robust. With people on a night out, even more than voluteer experiment participants, you need to make the instructions straightforward, and in this set up there isn’t time for questions or mistakes.
The live formula also means there is limited insight into what groups really did (although the follow up run in the lab helps with this). 1 minute isn’t long for a group of five to discuss anything.
Ultimately, the result is compelling and fits with the other work this newsletter is preoccupied with about the power of reasoned argument. Like with all good studies, I’d really like to see it replicated by another research team - both to fully trust that the result holds, and better understand why it does.
Citation
Navajas, J., Niella, T., Garbulsky, G., Bahrami, B., & Sigman, M. (2018). Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour, 2(2), 126-132. https://doi.org/10.1038/s41562-017-0273-4
See also the preprint: https://arxiv.org/abs/1703.00045 (free access for all)
And head to YouTube for “A video describing the experimental procedure and showing the crowd performing the experiment is available”
See also
Barrera-Lemarchand, F., Balenzuela, P., Bahrami, B., Deroy, O., & Navajas, J. (2024). Promoting erroneous divergent opinions increases the wisdom of crowds. Psychological Science, 35(8), 872-886.
Catch up
Check out what I’m doing with my career break here
Some personal news (tl;dr writing more)
Dive into series on how to think straight about the new AI here
And catch up on recent psychology posts
Read on for other other things I’ve noticed this week
LLM giveaways
Kobak et al (2025) downloaded all pubmed abstracts and give us the empirically observed change of word frequencies - suggesting that some words are now in the scientific literature because they are disproportionately likely to be produced by LLMs:
Surely there are also changes in word fashion? It would be nice to see a comparable plot for words which were surprisingly popular in 2014 too.
Kobak, D., González-Márquez, R., Horvát, E. Á., & Lause, J. (2025). Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances, 11(27), eadt3813.
… And finally
This photo was coughed up from my personal archives, in a folder labelled “Spring 2012”. No other details, but a bit of searching suggests it is a piece of (now gone) street art from Sheffield by faunagraphic & rocket01
END
Comments? Feedback? Good examples of estimation questions we could ask in experiments where it isn’t easy to Google the answers? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online









That's a really good reminder of what the rules are to make the wisdom of the crowd produce a better estimate than an expert guess.
While I bet someone in the audience might have known the actual number of Roman Emperors, and the sum of numbers on a roulette wheel is trivia enough well-known some people *had* to know, it's a different game entirely when we talk about unknowable answers.
Folks in Galton's study had no means of knowing the ox's weight upfront. A famous wisdom of crowds story is looking for the location of the USS Scorpion once it grounded. There was no one who had the answer.
And it's a similar story with any estimates we make. We don't have a way of knowing upfront how long it will take.
Yet, all the wisdom of the crowd's caveats still apply:
a) People need to have relevant information
b) They need to act independently (which in many estimation contexts doesn't work)
c) There needs to be diversity of opinions
Just publicly throwing guesses of numbers around features we have little understanding of *is not* the wisdom of crowds. Even worse if one person in a room has more decision leverage than others (yup, planning poker, I'm looking at you).
I can't get over the fact that the median person thought there were only 10 Roman emperors. I would have underestimated the number, but just 10 is insane, and almost half of people had to have thought it was even less than that.