Superforecasting Notes

4) Superforecasters

After the intelligence failure that led to the Iraq War, the intelligence community was shaken to the core. In 2006 IARPA was created to do cutting edge research into intelligence work for the United States intelligence community. IARPA, looking at the investigatory report into what caused this debacle decided to act on the reports advisory to create a forecasting tournament to figure out what methods are optimal for predicting the future. IARPA's tournament had several teams, ranging from various intelligence community groupings to university backed operations competing against a control group. IARPA wanted teams to beat the control groups 'wisdom of the crowd' by 20% in the first year and then 50% in the fourth. Each team could run internal analysis to figure out what worked and what didn't, furthering the research possibilities.

Philip Tetlock put together a team of 3200 volunteers and had them take an array of psychometric tests before joining up in a team he dubbed the Good Judgement Project. Then he gave them some training on how to make a good forecast and segregated into various control groups. Some worked in teams and some worked alone, some received training and some did not. The Good Judgement Project performed so well that IARPA ended up dropping the other teams. One group of forecasters stood out above the rest, giving very very accurate forecasts. A trained statician of course would expect this to be in all liklihood a fluke, the inevitable musings of the bell curve over a large data set. By testing these forecasters multiple times for a regression to the mean Tetlock was able to tease out the lucky from those with genuine skill, all told he still had dozens of leftover 'superforecasters' who perform far above average.

The superforecasters are good, they perform better than prediction markets, trained intelligence agents with access to classified information, and of course the other forecasters in the Good Judgement Project. But why?

5) Supersmart?

How smart do you need to be to predict like a superforecaster? Before participating in the Good Judgement Project forecasters took a Ravens Progressive Matrice test. Compared to the general population in the United States forecasters scored in the 70th percentile and superforecasters scored 80th percentile. That is to say, the biggest leap was between a general member of the population and a forecaster, between forecaster and superforecaster there were ten percentage points of difference in score against normal Americans. You need to be smart, but exceptional intelligence is not required.

It's rare to have all the information you need to make a complete prediction. Instead you should use fermi estimates to create a bounded model somewhere in the ballpark of what the true probability is. Ask yourself what needs to be true for something to happen, try inverting the statement of the prediction to correct for confirmation bias. Take the outside view and get numbers from general demographics and non-situation-specific statistics before inside view narrative based evaluation.

Superforecasters are very open to new experiences, having a high O in the big five personality is an asset to making accurate predictions. It is important to consider the viewpoints of others, one superforecaster Doug Lorch created a database of information sources tagged by ideological bent, geographic location and other factors so that he can actively optimize his information intake for more diversity.

6) Superquants?

Do you need complicated mathematical models to make accurate forecasts? Tetlock found that his superforecasters were almost universally numerically literate as tested by a numeracy test given to applicants as part of the signup testing. However, in spite of this they did not usually use 'hardcore' mathematical models to make their predictions, instead relying on subjective judgement and prescise bounds of thinking. Superforecasters would consider questions in fine detail, going so far as to distinguish between single digit percentage differences in probability for a scenario.

Empirically, forecasters who put in the effort to really drill down on whether it's 75% or 79% do better than people who think in fives (eg. 75 versus 80) and those people do better than those that think in tens (70 versus 80) and those people do better than those that think in 100, 0, and 50/50. In a further analysis of the data moving all predictions up by one level of granularity caused the superforecasters to lose accuracy, proving that their extra prescision belies real signal, not noise.

7) Supernewsjunkies?

In the Good Judgement Project forecasters were allowed to update their forecast in response to new events. Superforecasters updated much more often on average than normal forecasters. This raises the question of whether the superforecasters success can mostly be attributed to following the news and making close updates to their forecasts. If this is the case then it would be easy to dismiss their success as the fairly trite recommendation that you should keep your predictions updated.

However this is not the case. Superforecasters initial predictions were at least

 50% more accurate than those of regular forecasters. Even if the tournament had

not allowed updating at all Superforecasters would come ahead with a huge lead. Updating is a subtle art that can be just as if not more challenging than making the initial prediction. You can be underconfident in your updates and not see an outcome when it should have seemed more likely, or you can be overconfident in your updates and think things are more likely than they are.

The most common cause of underconfidence in updates is a commitment to the original belief. Tetlock illustrates with the example of Japanese internment, where even after sabotage failed to materalized for years those responsible refused to admit that their security descision to imprison all Japanese Americans was mistaken. It would pay to not get too attached to any one belief. Tetlock further speculates that the lack of reputation his superforecasters have staked on the outcome of any individual question (as they are non domain experts ) probably helps them avoid this bias.

Overconfidence in updates is also dangerous. Stock traders who see meaning in ephermeal changes in a stocks day to day price or react based on long-term inconsequential information end up posting worse gains than those who simply buy and hold. Superforecasters usually make many small updates that grow incrementally closer to the true answer. This is very similar to Bayes Theorem, (Bayes Theorem is touted as the prediction gold standard by outlets like LW and co. these outlets are not mentioned in the book) and a natural question is whether you need to know the exact mechanics of Bayes to get good results. The answer is no. Most superforecasters used a Bayes like method with their subjective judgement but did not crunch the numbers of the actual Bayes theorem. This includes the top forecaster in the good judgement project's third season, Tim Minto, who finished with a Briar Score of 0.15.

8) Perpetual Beta

Superforecasters have a growth mindset. They are constantly on the lookout for ways to improve themselves and do post mortems to figure out what went wrong and how they can do better next time. It is often very difficult to figure out what mindset you had when you initially made a forecast in retrospect, so Superforecasters often include long explanations of their original reasoning for making a prediction so that it is easier to analyze what they did later.

In order to improve you must get clear feedback on what you did wrong. In studies of police officers ability to distinguish truths from lies it was found that policeman are generally overconfident and become more so as they gain experience. Their confidence grows faster than the actual improvement in their skill. Because they do not get clear feedback on whether they are actually distinguishing truth from falsehood they improve only marginally but get the mistaken impression that their experience contributes significant skill improvement.

Superforecasters have grit, not being dissuaded by a string of bad luck or setbacks. They will go the extra mile, Tetlock gives the example of a woman (who is not a superforecaster) that asked for information pertaining to a question from a UN agency ahead of schedule from their published data. When it was sent in French she asked if they could resend in English. (They could, and did.)

9) Superteams

How well can superforecasters work in teams? Do teams help or hinder? Even before Tetlock had identified a single Superforecaster he was already considering the question of whether teams make a forecaster more or less effective. Regardless of how he and his project partners felt about the question (so as not to give you the wrong impression, unsure and very curious about the answer) it would be necessary to test and understand how forecasters work in teams because most forecasts are made in a team setting. This information would be vitally important to be applicable in the field.

As mentioned earlier in these notes, forecasters were assigned to various experiment groups. Some worked in teams and some worked alone. But who performed better on average? At the end of year one teams performed 23% better than individuals. This made it clear that teams were the way for GJP to go. In the Bay Of Pigs invasion and the Cuban Missile Crisis, Kennedy's advisors bungled and then handled it admirably in each case respectively.

Why the sudden difference in performance? Kennedy kept the same advisors, but changed the process by which they make descisions. He brought in outsiders to ask pointed questions, temporarily suspended hiearchy so that it didn't prevent 'lower' members of the group from questioning more senior members, and would leave the room to let the team debate among itself without having to worry about what he thinks. Kennedy also kept his preconcieved notion of what the plan needs to be close to his chest so that the group had a chance to come up with something better.

These are all great ways to improve the performance of a forecasting team. Tetlock's superforecasters had some initial starting difficulties when put into teams, mostly surrounding criticism and figuring out who is and isn't amenable to discussion about their forecast. These barriers were overcome by explicit requests for criticism and lengthy comments about why a forecast was made so that others can discuss and critique.

Superforecasting teams used a process that made them Actively Open Minded, cooperating as part of a shared purpose so that correction of mistakes matters more than individual egos in the pursuit. You want an opinionated team that will engage each other in the purusit of truth.

Pro-social giving behavior on the part of team members inspires better behavior in others. 'Givers' who give more than they take from other team members often end up coming out on top. The best individual score was held by a giver in year two and another giver in year 3. Diversity of team members worldviews contributed significantly to their performance, to the point where a colleague conjectured that diversity has more of an impact than raw ability.

10) The Leader's Dilemma

Superforecasting seems to require the opposite traits expected of good leadership. How many humble leaders can you name besides Ghandi? Tetlock argues that good leadership cleanly divides the descision making phase from the action phase. Using the running example of the German Wehrmacht, Tetlock illustrates how decentralized leadership is more flexible and makes better descisions in the field than rigid command and control structures. The Wehrmacht ran off the principle of Mission Command, in which commanders tell subordinates the goal of an order but leave the exact mechanics of how to accomplish it up to individual commanders, colonels, captains, seargents, and privates. (Of course, this filters down the command chain such that the orders the colonel gives are sub goals of the commanders.)

The Wehrmacht insisted on mental flexibility in study and in the field. During academy lessons it was expected that criticism of plans would come from all levels of the organization, not just esteemed generals. The Wehrmacht was frighteningly effective, able to executive orders when plans went awry or traditional leadership was unavailable for consultation. The Wehrmacht command manual was very clear on the need for decisive action, but balanced this with the need for sane descision making. You consider a course of action, and then execute it wholeheartedly. Only when it is clear that the action in execution is doomed to failure should you switch, as switching has high opportunity cost and morale cost. Mission Command is now the doctrine used by the United States military.

Humility should not be about a lack of confidence in ones ability, so much as a recognition of the complexity of the game one is playing. In war, poker, and many many other human endeavors there is simply mind boggling complexity to deal with and even geniuses have trouble. Tetlock also emphasizes that he deliberately used the example of the Wehrmacht because it makes people squirm and being able to look past that to see the impressive qualities to emulate in the hated enemy is important.

11) Are They Really So Super?

Daniel Kahneman and Nassim Taleb are skeptical about superforecasters. Kahneman thinks that human bias is pretty much unavoidable. He participated in a study on whether they suffer from scope insensitivity. Superforecasters did better than regular forecasters on it, or at least gave different probabilities for different timeframes of the Syrian regime collapsing. Taleb's criticism is that the kind of forecasting superforecasters do is trivial, history is ruled by very low probability black swans with high impact. If you can predict something it's probably not that important.

Tetlocks objection is that most 'black swans' are actually an unpredictable event which has predictable followup consequences. If intelligence agents had been able to predict the disasters of the wars in Iraq and Afghanistan, they wouldn't have happened. 9/11 was a 'black swan' only because of its larger consequences that were theoretically within human control. Furthermore 9/11 was a scenario that defense analysts had been worrying about for years prior, it was not completely unanticipated in the slightest.

The rest is just an explanation of the concept of antifragility, why it makes sense, etc.

12) What's Next?

What's the ideal long term impact of this research? Tetlock hopes that pundits and consumers of forecasts pay attention, because good forecasting can mean the difference between success and failure in a wide variety of human endeavors. He sees some sign that a handful of pundits are already paying attention, and that more will follow. Tetlock also examines the question of why we should expect change at all when forecasters have every incentive not to be held accountable. He uses the example of evidence based medicine and how the existing order of unaccountable physicians who hated the idea were overthrown by those that were willing to be empirical with the publics support.

Whether forecasting can be be made more rigorous does depend on how the informed public reacts to Tetlock's research. Right now forecasts are often judged on how well they support the tribal affiliation of the forecaster. Lenin would probably say that forecasts which are lies but serve the right political purpose accomplish their aims just fine. This does not mean change is impossible but it does mean that change will be difficult. We're already seeing an evidence-based revolution in legislation and charity. (He does not mention EA by name, but EA would certainly be considered part of it I'm sure.)

Tetlock takes these final pages to consider a few more possible objections to his work. One of these is that the questions forecasters answer in the IARPA tournament aren't really important, we don't care if North Korea attempts to launch a nuke we care about the big picture outcome of what happens when they do. Of course, we can answer a lot of that big picture outcome by asking many questions. You can formulate a big question "Will there be a new war in the korean penninsula?" by considering all the factors that go into one and asking questions about them.

Finally, Tetlock thinks his research may be able to help resolve certain kinds of partisan dispute that are unnecessarily partisan. He gives the example of Keynesian economics versus Austerian economics. Lots of countries implemented policies more according to one philosophy or the other. We should be able to tabulate the outcomes and update our beliefs on what did and didn't happen. We need to get serious about keeping score.