An Unpredicted Outcome of Our Exchange: Testable Hypotheses

Each of the three commentaries on our target essay highlights obstacles to distinguishing real from pseudo-expertise in the political realm—and raises issues that we welcome the chance to address. Robin Hanson poses a puzzle: Why are thoughtful people so reluctant to embrace technologies, such as prediction markets, that have a track record of improving predictive accuracy? John Cochrane defends beleaguered hedgehogs by proposing that the comparative advantage of hedgehogs lies in their deep understanding of causal linkages and their ability to generate accurate conditional forecasts, something that we failed to capture in the forecasting exercises in Expert Political Judgment. And Bruce Bueno de Mesquita brings into focus the dangers of treating the fox and hedgehog approaches to knowledge as mutually exclusive. He suggests that hybrid approaches, such as his own, are best positioned to optimize forecasting performance in a noisy and densely interdependent world.

Readers hoping for polemical sparks to fly will be disappointed. To be sure, we have our disagreements with the commentators. But we see the arguments they raise as reasonable—and, far more important, testable. We would not be upset if subsequent work found support for each. And we see no reason why reasonable people cannot come to agreement on what would count as fair tests of these hypotheses. In a nutshell, we see wonderful opportunities to advance knowledge—and let the ideological chips fall where they may—which is the spirit of both the Expert Political Judgment project and its IARPA sequel.

Robin Hanson and Forecasting Tournaments: Threatening to Elites and Enticing to Their Challengers

Hanson’s reply brought to mind the scene in the movie A Few Good Men in which Jack Nicholson, the colonel in charge of Guantánamo, is badgered by a young idealistic attorney, Tom Cruise, in a murder investigation. In response to the attorney’s earnest entreaties that he just wants the “truth,” Nicholson launches into a tirade: the idealistic attorney has no inkling of the ugly trade-offs that the defense of liberty requires. He wouldn’t recognize, less tolerate, the truth if it stared him in the face.

That scene came to mind because Hanson finds so little real-world interest in the “truth.” Excellent econo-detective that he is, he explores the implications of this finding via deductive logic: (1) simple inexpensive tools, such as prediction markets, are already easily available for assessing accuracy and for improving it; (2) many people in business and government are aware of the efficacy of these tools but still do not embrace them; (3) so there must be either hidden costs to these tools beyond their nominal price tags or fewer benefits than advocates are supposing.

To appreciate these hidden costs, imagine that you are a renowned opinion guru, or possibly a lesser luminary in the same field. Multitudes (or just a few) read your commentaries. You are moderately famous and wealthy (or not). You may have a hard core of fans and a much larger group of quite regular readers who think you know what you’re talking about. You might be Tom Friedman or Paul Krugman or Holman Jenkins or David Brooks, or someone we’ve never heard of yet. How would you react if upstart academics—such as Robin Hanson or Phil Tetlock—pressed you to participate in a level-playing field exercise for assessing your relative forecasting accuracy: relative, that is, to rival pundits, relative to your readers, and relative to crude extrapolation algorithms or random-guess generators?

If you’re a big fish, you would be less than thrilled. What’s in it for you? If you perform poorly, you take a big reputation hit. Presumably your fans read your columns because they think they are getting a more accurate bead on where events are heading than they could obtain by spinning off predictions of their own or randomly throwing darts at a futures board—

—and they might be disappointed if disabused of that notion. And if you perform relatively well, you might take a hit anyway because you’re unlikely to do much better than your fans and quite likely to do a bit worse than simple extrapolation algorithms. You break even only in the extremely unlikely event that your forecasting performance matches the idealized image that your followers have of you. There just aren’t many six-sigma forecasters out there—by statistical definition. The likelihood of Tom Friedman delivering a performance that justifies his speaking fees is extremely low. The sensible thing is not to rock the boat: continue offering opinions that presuppose you must have deep insights into the future but at all cost never get pinned down.

The same logic applies to the smaller fish, whose opinions are not distributed daily to millions but may still carry cachet inside their companies. Open prediction contests will reveal how hard it is to outperform their junior assistants and secretaries. Insofar as technologies such as prediction markets make it easier to figure out who has better or worse performance over long stretches, prediction markets create exactly the sort of transparency that destabilizes status hierarchies. From this standpoint, it is impressive that Robin Hanson has gotten as wide a hearing as he has for his subversive agenda inside organizations—although that might be because elites must at least feign an interest in the truth.

If these hypotheses are correct the prognosis for prediction markets—and transparent competitions of relative forecasting performance—is grim. Epistemic elites are smart enough to recognize a serious threat to their dominance.

A powerful countervailing force needs to come into play to make the marketplace of ideas more vigorously competitive. But who or what should that be?

The IARPA forecasting exercise is a tentative step toward institutionalizing open prediction forums that allow testing the predictive power a wide range of individuals, techniques, and methods of blending individual and technical approaches (we count prediction markets as one of the most promising of the technologies). The more widely known and respected such competitions become, the harder it will be for pundits to continue business as usual—which means appearing to make bold claims about future states of the world but never really doing so (because their “predictions” are typically vague and cushioned from disconfirmation by complex conditionals, as we will discuss below.

One desirable result would be the emergence of a Consumer Reports for political-economic punditry that tracks the relative accuracy in different policy domains of representatives of clashing perspectives. But if wishes were horses, beggars would ride. The challenges will be finding sponsors ready to bankroll the operation (government sponsorship is obviously problematic—so volunteers please) and finding pundits ready to make clear claims that can be tested for accuracy (passing “the clairvoyance test”). These challenges are formidable but in our view the potential long-term benefits are large indeed. You don’t need to reduce the likelihood of multi-trillion-dollar mistakes by much to make a multi-million dollar investment pass the expected-value test.

John Cochrane in Defense of Hedgehogs

Cochrane rises to the defense of hedgehogs—our shorthand designation for the style of thinking associated with the worst predictive performance in the forecasting tournaments reported in Expert Political Judgment. Our treatment of the pros and cons of the different thinking styles—foxes versus hedgehogs—was somewhat cursory in our target article, so we would like to offer some reassurance to Cochrane, who might be surprised to discover that each of his defenses has already been given a hearing either in the 2005 book or in the 2010 fifth anniversary symposium in Critical Review. And he might be even more surprised to learn that we believe there is at least potential merit in each of the defenses that he raises.

Let’s start by noting that how easy it is a negative or positive political spin on both the fox and the hedgehog styles of reasoning. The negative spin—on which we admittedly focus in our target article—is that hedgehogs are both simpleminded and overconfident. They fall in love with one big idea and over apply that idea to a messy, exception-riddled empirical world. The positive spin on the hedgehog style of reasoning—which we largely ignored in our target article—is that hedgehogs can be bold visionaries who come up with incisive parsimonious explanations and solutions. The positive spin on foxes is that they are creative eclectics who managed to avoid the greatest blunders of hedgehogs and often succeed in crafting viable compromise policies. The negative spin on foxes is that they are confused, indecisive, and unprincipled. For the record, we believe there are good historical illustrations of each phenomenon.

But it is one thing to find examples and quite another to keep systematic store of relative accuracy. And here hedgehogs simply do not fare well—at least so far. And that is why we doubt Cochrane’s hypothesis that hedgehogs lost to foxes in the initial round of forecasting tournaments because we asked for crude unconditional rather than subtle conditional forecasts.

That said, Cochrane’s hypothesis still deserves to be systematically tested. Unfortunately, it is very hard to test.

Consider a prototypic conditional forecast: if policy x is adopted, expect outcome y (within this range of subjective probabilities) but if policy v is adopted, expect outcome w. The problem is that assessing the accuracy of conditional forecasts requires first assessing whether the conditionals were realized or implemented. That means that policy x or v—exactly as the forecaster intended it—was enacted. When something close to x or v was enacted but something not all that close to outcome y followed, the forecasters have a lot of wiggle room. They can always insist—as forecasters did in chapter 4 of Expert Political Judgment—that the conditionals underlying their forecasts were not satisfied.

Putting the enormous practical difficulty of the task to the side, we also have two empirical reasons for doubting Cochrane’s hypothesis: (a) there is virtually zero evidence in Expert Political Judgment, or anywhere else, that experts are any better or worse at predicting policy shifts within regimes, or changes in regimes (the most common antecedents of conditionals) than they are at predicting economic, geopolitical, or military outcomes (the most common consequences of the conditionals); (b) there is considerable evidence that, if anything, experts become decreasingly accurate in judging the probabilities of increasingly complex conjunctions of events (see the scenario-induced inflation of subjective probabilities in Chapter 7 of EPJ (and ‘‘the conjunction fallacy’’ in Tversky and Kahneman 1983). For instance, forecasters—sometimes even really smart ones—often perversely assign a higher probability to a plausible-sounding conjunction of events—such as a dam collapse in California causing mass drowning—than they do to the judged-less-likely linkage in the argument (mass drowning in California, with or without a dam collapse). Of course, this is logically impossible. The dam collapse is not the only path to mass drowning in California.

A more plausible defense of hedgehogs, in our view, is that relative to the foxes (in this context maybe better dubbed “chickens”), the hedgehogs are bold forecasters who account for a disproportionate number of grand slam homeruns. When something really surprising happens—a dramatic decline or expansion of economic growth or dramatic increases or decreases in international tension—there are usually hedgehogs who can claim credit for the predictions. Of course, there is no free lunch. The price of these occasional spectacular hits is a high rate of false positives, which hedgehogs must be willing to defend on the grounds that the “hits” are worth it. And that is a value judgment. Hedgehogs tend to be overrepresented at the extremes of opinion distributions—and to make more extreme forecasts that reflect their distinctive value priorities.

Bruce Bueno de Mesquita and the Predictive Power of Game Theory

Bruce Bueno de Mesquita’s reaction essay reframes hedgehog-fox distinction. He raises the plausible and quite testable hypothesis that the optimal approach to forecasting will prove over the long run to be one that combines the distinctive strengths of the more opportunistic foxes and the more theory-driven, deductive style of reasoning of hedgehogs. Bueno de Mesquita’s methodology arguably does this by drawing on the raw knowledge of area studies experts (whom he thinks are more likely to be “foxes”) and then integrating this knowledge using basic principles of decision theory and game theory to arrive at predictions about what elite decision-making groups are likely to do in various scenarios.

He might be right. He is a brilliant and prolific political scientist. But we are from Missouri. We want to discover the answers to three categories of questions:

What happens when we pit Bueno de Mesquita’s approach against formidable competition on a level playing field under the control of a neutral respected arbiter (such as IARPA?) that is committed to making all data public and to laying out the scoring rules in advance;
Can we decompose his approach into its components? Decomposition of his multistage prediction generation machine is crucial for identifying where the greatest value is added. For instance, it is one thing to say that his method outperforms 80% of the individual area-study experts who provide the informational inputs to the algorithms. It is quite another to show that the machine can outperform the average forecast of the same experts, the infamous wisdom of the crowd effect. After all, we know that averaged forecasts are often better than 80% or more of the individuals from whom the averages were derived.
Can we ensure that Bueno de Mesquita’s system is indeed pitted against the most formidable competitors in academia and the private sector? As already noted above, simple averaging of experts is often a very tough benchmark to beat. And it is easy to imagine refinements that might be tougher—such as giving greater weight to forecasters with a track record of being better calibrated or greater weight to forecasters who have been given various forms of training in probabilistic reasoning or forecasters who are interacting in prediction markets or other types of social systems for information aggregation.

Again, we do not know which hypotheses will fare well and which will bite the dust. But we are glad to share our current best guess. For many categories of forecasting problems, we are likely to bump into the optimal forecasting frontier quite quickly. There is an irreducible indeterminacy to history and no amount of ingenuity will allow us to predict beyond a certain point. But no one knows exactly where that point lies. And it is worth finding out in as dispassionate and non-ideological fashion as we human beings are capable of mustering.

An Unpredicted Outcome of Our Exchange: Testable Hypotheses

Also from this issue

Lead Essay

Response Essays

The Conversation