Overcoming Our Aversion to Acknowledging Our Ignorance

Each December, The Economist forecasts the coming year in a special issue called The World in Whatever-The-Next-Year-Is. It’s avidly read around the world. But then, like most forecasts, it’s forgotten.

The editors may regret that short shelf-life some years, but surely not this one. Even now, only halfway through the year, The World in 2011 bears little resemblance to the world in 2011. Of the political turmoil in the Middle East—the revolutionary movements in Tunisia, Egypt, Libya, Yemen, Bahrain, and Syria—we find no hint in The Economist’s forecast. Nor do we find a word about the earthquake/tsunami and consequent disasters in Japan or the spillover effects on the viability of nuclear power around the world. Or the killing of Osama bin Laden and the spillover effects for al Qaeda and Pakistani and Afghan politics. So each of the top three global events of the first half of 2011 were as unforeseen by The Economist as the next great asteroid strike.

This is not to mock The Economist, which has an unusually deep bench of well-connected observers and analytical talent. A vast array of other individuals and organizations issued forecasts for 2011 and none, to the best of our knowledge, correctly predicted the top three global events of the first half of the year. None predicted two of the events. Or even one. No doubt, there are sporadic exceptions of which we’re unaware. So many pundits make so many predictions that a few are bound to be bull’s eyes. But it is a fact that almost all the best and brightest—in governments, universities, corporations, and intelligence agencies—were taken by surprise. Repeatedly.

That is all too typical. Despite massive investments of money, effort, and ingenuity, our ability to predict human affairs is impressive only in its mediocrity. With metronomic regularity, what is expected does not come to pass, while what isn’t, does.

In the most comprehensive analysis of expert prediction ever conducted, Philip Tetlock assembled a group of some 280 anonymous volunteers—economists, political scientists, intelligence analysts, journalists—whose work involved forecasting to some degree or other. These experts were then asked about a wide array of subjects. Will inflation rise, fall, or stay the same? Will the presidential election be won by a Republican or Democrat? Will there be open war on the Korean peninsula? Time frames varied. So did the relative turbulence of the moment when the questions were asked, as the experiment went on for years. In all, the experts made some 28,000 predictions. Time passed, the veracity of the predictions was determined, the data analyzed, and the average expert’s forecasts were revealed to be only slightly more accurate than random guessing—or, to put more harshly, only a bit better than the proverbial dart-throwing chimpanzee. And the average expert performed slightly worse than a still more mindless competition: simple extrapolation algorithms that automatically predicted more of the same.

Cynics resonate to these results and sometimes cite them to justify a stance of populist know-nothingism. But we would be wrong to stop there, because Tetlock also discovered that the experts could be divided roughly into two overlapping yet statistically distinguishable groups. One group would actually have been beaten rather soundly even by the chimp, not to mention the more formidable extrapolation algorithm. The other would have beaten the chimp and sometimes even the extrapolation algorithm, although not by a wide margin.

One could say that this latter cluster of experts had real predictive insight, however modest. What distinguished the two groups was not political ideology, qualifications, access to classified information, or any of the other factors one might think would make a difference. What mattered was the style of thinking.

One group of experts tended to use one analytical tool in many different domains; they preferred keeping their analysis simple and elegant by minimizing “distractions.” These experts zeroed in on only essential information, and they were unusually confident—they were far more likely to say something is “certain” or “impossible.” In explaining their forecasts, they often built up a lot of intellectual momentum in favor of their preferred conclusions. For instance, they were more likely to say “moreover” than “however.”

The other lot used a wide assortment of analytical tools, sought out information from diverse sources, were comfortable with complexity and uncertainty, and were much less sure of themselves—they tended to talk in terms of possibilities and probabilities and were often happy to say “maybe.” In explaining their forecasts, they frequently shifted intellectual gears, sprinkling their speech with transition markers such as “although,” “but,” and “however.”

Using terms drawn from a scrap of ancient Greek poetry, the philosopher Isaiah Berlin once noted how, in the world of knowledge, “the fox knows many things but the hedgehog knows one big thing.” Drawing on this ancient insight, Tetlock dubbed the two camps hedgehogs and foxes.

The experts with modest but real predictive insight were the foxes. The experts whose self-concepts of what they could deliver were out of alignment with reality were the hedgehogs.

It’s important to acknowledge that this experiment involved individuals making subjective judgements in isolation, which is hardly the ideal forecasting method. People can easily do better, as the Tetlock experiment demonstrated, by applying formal statistical models to the prediction tasks. These models out-performed all comers: chimpanzees, extrapolation algorithms, hedgehogs, and foxes

But as we have surely learned by now—please repeat the words “Long Term Capital Management”—even the most sophisticated algorithms have an unfortunate tendency to work well until they don’t, which goes some way to explaining economists’ nearly perfect failure to predict recessions, political scientists’ talent for being blindsided by revolutions, and fund managers’ prodigious ability to lose spectacular quantities of cash with startling speed. It also helps explain why so many forecasters end the working day with a stiff shot of humility.

Is this really the best we can do? The honest answer is that nobody really knows how much room there is for systematic improvement. And, given the magnitude of the stakes, the depth of our ignorance is surprising. Every year, corporations and governments spend staggering amounts of money on forecasting and one might think they would be keenly interested in determining the worth of their purchases and ensuring they are the very best available. But most aren’t. They spend little or nothing analyzing the accuracy of forecasts and not much more on research to develop and compare forecasting methods. Some even persist in using forecasts that are manifestly unreliable, an attitude encountered by the future Nobel laureate Kenneth Arrow when he was a young statistician during the Second World War. When Arrow discovered that month-long weather forecasts used by the army were worthless, he warned his superiors against using them. He was rebuffed. “The Commanding General is well aware the forecasts are no good,” he was told. “However, he needs them for planning purposes.”

This widespread lack of curiosity—lack of interest in thinking about how we think about possible futures—is a phenomenon worthy of investigation in its own right. Fortunately, however, there are pockets of organizational open-mindedness. Consider a major new research project funded by the Intelligence Advanced Research Projects Activity, a branch of the intelligence community.

In an unprecedented “forecasting tournament,” five teams will compete to see who can most accurately predict future political and economic developments. One of the five is Tetlock’s “Good Judgment” Team, which will measure individual differences in thinking styles among 2,400 volunteers (e.g., fox versus hedgehog) and then assign volunteers to experimental conditions designed to encourage alternative problem-solving approaches to forecasting problems. The volunteers will then make individual forecasts which statisticians will aggregate in various ways in pursuit of optimal combinations of perspectives. It’s hoped that combining superior styles of thinking with the famous “wisdom of crowds” will significantly boost forecast accuracy beyond the untutored control groups of forecasters who are left to fend for themselves.

Other teams will use different methods, including prediction markets and Bayesian networks, but all the results will be directly comparable, and so, with a little luck, we will learn more about which methods work better and under what conditions. This sort of research holds out the promise of improving our ability to peer into the future.

But only to some extent, unfortunately. Natural science has discovered in the past half-century that the dream of ever-growing predictive mastery of a deterministic universe may well be just that, a dream. There increasingly appear to be fundamental limits to what we can ever hope to predict. Take the earthquake in Japan. Once upon a time, scientists were confident that as their understanding of geology advanced, so would their ability to predict such disasters. No longer. As with so many natural phenomena, earthquakes are the product of what scientists call “complex systems,” or systems which are more than the sum of their parts. Complex systems are often stable not because there is nothing going on within them but because they contain many dynamic forces pushing against each other in just the right combination to keep everything in place. The stability produced by these interlocking forces can often withstand shocks but even a tiny change in some internal conditional at just the right spot and just the right moment can throw off the internal forces just enough to destabilize the system—and the ground beneath our feet that has been so stable for so long suddenly buckles and heaves in the violent spasm we call an earthquake. Barring new insights that shatter existing paradigms, it will forever be impossible to make time-and-place predictions in such complex systems. The best we can hope to do is get a sense of the probabilities involved. And even that is a tall order.

Human systems like economies are complex systems, with all that entails. And bear in mind that human systems are not made of sand, rock, snowflakes, and the other stuff that behaves so unpredictably in natural systems. They’re made of people: self-aware beings who see, think, talk, and attempt to predict each other’s behavior—and who are continually adapting to each other’s efforts to predict each other’s behavior, adding layer after layer of new calculations and new complexity. All this adds new barriers to accurate prediction.

When governments the world over were surprised by this year’s events in the Middle East, accusing fingers were pointed at intelligence agencies. Why hadn’t they seen it coming? “We are not clairvoyant,” James R. Clapper Jr, director of national intelligence, told a hearing of the House intelligence committee. Analysts were well aware that forces capable of generating unrest were present in Tunisia, Egypt, and elsewhere. They said so often. But those forces had been present for years, even decades. “Specific triggers for how and when instability would lead to the collapse of various regimes cannot always be known or predicted,” Clapper said.

That is a considerable understatement. Remember that it was a single suicidal protest by a lone Tunisian fruit seller that set off the tumult, just as an infinitesimal shift can apparently precipitate an earthquake. But even after the unrest had begun, predicting what would follow and how it would conclude was a fool’s errand because events were contingent on the choices of millions of people, and those choices were contingent on perceptions that could and did change constantly. Say you’re an Egyptian. You’re in Cairo. You want to go to the protest but you’re afraid. If you go and others don’t, the protest will fail. You may be arrested and tortured. But if everyone goes, you will have safety in numbers and be much likelier to win the day. Perhaps. It’s also possible that a massive turnout will make the government desperate enough to order soldiers to open fire. Which the soldiers may or may not do, depending in part on whether they perceive the government or the protestors to have the upper hand. In this atmosphere, rumors and emotions surge through the population like electric charges. Excitement gives way to terror in an instant. Despair to hope. And back again. What will people do? How will the government react? Nothing is certain until it happens. And then many pundits declare whatever happened was inevitable. Indeed, they saw it coming all along, or so they believe in hindsight.

So we are not blind but there are serious limits to how far we can see. Weather forecasting is a useful model to keep in mind. We joke about weather forecasters but they have some good mental habits we should all practice: making explicit predictions and revising them in response to clear timely feedback. The net result is that weather forecasters are one of the best calibrated of all professional groups studied—up there with professional bridge players. They have a good sense for what they do and do not know.

But well calibrated does not mean omniscient. As weather forecasters well know, their accuracy extends out only a few days. Three or four days out, they are less accurate. Beyond a week, you might as well flip a coin. As scientists learn more about weather, and computing power and sophistication grow, this forecasting horizon may be pushed out somewhat, but there will always be a point beyond which meteorologists cannot see, even in theory.

We call this phenomenon the diminishing marginal predictive returns of knowledge.

In political and economic forecasting, we reach the inflection point surprisingly quickly. It lies in the vicinity of attentive readers of high-quality news outlets, such as The Economist. The predictive value added of Ph.Ds, tenured professorships and Nobel Prizes is not zero but it is disconcertingly close to zero.

So we should be suspicious of pundits waving credentials and adopt the old trust-but-verify mantra: test the accuracy of forecasts and continually be on the lookout for new methods that improve results. We must also accept that even if we were to do this on a grand scale, and our forecasts were to become as accurate as we can possibly make them, there would still be failure, uncertainty, and surprise. And The World In Whatever-The-Next-Year-Is would continue to look quite different from the world in whatever the next year is.

It follows that we also need to give greater consideration to living with failure, uncertainty, and surprise.

Designing for resiliency is essential, as New Zealanders discovered in February when a major earthquake struck Christchurch. 181 people were killed. When a somewhat larger earthquake struck Haiti in 2010, it killed hundreds of thousands. The difference? New Zealand’s infrastructure was designed and constructed to withstand an earthquake, whenever it might come. Haiti’s wasn’t.

Earthquakes are among the least surprising surprises, however. The bigger test is the truly unexpected shock. That’s when the capacity to respond is critical, as Canada demonstrated following the financial meltdown of 2008. For a decade prior to 2008, Canada’s federal government ran budgetary surpluses and used much of that money to pay down accumulated debt. When the disaster struck, the economy tipped into recession, and the government responded with an array of expensive policies. The budget went into deficit, and the debt-to-GDP ratio rose, but by both measures Canada continued to be in far better shape than most other developed countries. If further shocks come in the immediate future, Canada has plenty of capacity to respond—unlike the United States and the many other countries that did not spend a decade strengthening their fiscal foundations.

Accepting that our foresight will always be myopic also calls for decentralized decision-making and a proliferation of small-scale experimentation. Test the way forward, gingerly, one cautious step at a time. “Cross the river by feeling for the stones,” as the wily Deng Xiaoping famously said about China’s economic liberalization. Only madmen are sure they know what the future holds; only madmen take great leaps forward.

There’s nothing terribly controversial in this advice. Indeed, it’s standard stuff in any discussion of forecasting and uncertainty. But critical caveats are seldom mentioned.

There’s the matter of marginal returns, for one. As with most things in life, the first steps in improving forecasting are the easiest and cheapest. It doesn’t take a lot of analysis to realize that goats’ entrails and tea leaves do a very poor job of weather forecasting, and it takes only a little more analysis to discover that meteorologists’ forecasts are much better, and that switching from the former to the latter makes sense even though the latter costs more than the former. But as we make further advances in weather forecasting, we are likely to find that each incremental improvement will be harder than the last, delivering less benefit at greater cost. So when do we say that further advances aren’t worth it?

The same is true of resiliency. Tokyo skyscrapers are built to the highest standards of earthquake resistance because it is close to certain that in their lifespan they will be tested by a major earthquake. Other skyscrapers in other cities not so prone to earthquakes could be built to the same standards but that would raise the cost of construction substantially. Is that worth doing? And if we accept a lower standard, how high is enough? And what about all the other low-probability, high-impact events that could strike? We could spend a few trillion dollars building a string of orbital defences against killer asteroids. If that seems like a waste, what about the few hundred million dollars it would take to spot and track most asteroids? That may seem like a more reasonable proposition, but remember that some asteroids are likely to escape our notice. Not to mention comets. Or the many other shocks the universe could conceivably hurl at us. There’s no limit to what we can spend preparing for unpleasant surprises, so how much is enough?

And notice what we have to do the moment we try to answer a question like, “is it worth constructing this skyscraper so it is more resistant to major earthquakes?” The answer depends on many factors but the most important is the likelihood that the skyscraper will ever have to resist a major earthquake. Happily, we’re good at determining earthquake probabilities. Less happily, we’re far from perfect. One reason why the Japanese disaster was so devastating is that an earthquake of such magnitude wasn’t expected where it occurred. Even less happily, we’re far better at determining earthquake probabilities than countless other important phenomena we want and need to forecast. Energy supplies. Recessions. Revolutions. There’s a very long list of important matters about which we really have no choice but to make probability judgements even though the evidence suggests our methods aren’t working a whole lot better than goats’ entrails and tea leaves.

The optimist thinks that’s fabulous because it means there’s lots of room for improvement. The pessimist stockpiles dry goods and ammunition. They both have a point.

The optimists are right that there is much we can do at a cost that is quite modest relative to what is often at stake. For example, why not build on the IARPA tournament? Imagine a system for recording and judging forecasts. Imagine running tallies of forecasters’ accuracy rates. Imagine advocates on either side of a policy debate specifying in advance precisely what outcomes their desired approach is expected to produce, the evidence that will settle whether it has done so, and the conditions under which participants would agree to say “I was wrong.” Imagine pundits being held to account. Of course arbitration only works if the arbiter is universally respected and it would be an enormous challenge to create an analytical center whose judgments were not only fair, but perceived to be fair even by partisans dead sure they are right and the other guys are wrong. But think of the potential of such a system to improve the signal-to-noise ratio, to sharpen public debate, to shift attention from blowhards to experts worthy of an audience, and to improve public policy. At a minimum, it would highlight how often our forecasts and expectations fail, and if that were to deflate the bloated confidence of experts and leaders, and give pause to those preparing some “great leap forward,” it would be money well spent.

But the pessimists are right, too, that fallibility, error, and tragedy are permanent conditions of our existence. Humility is in order, or, as Socrates said, the beginning of wisdom is the admission of ignorance. The Socratic message has always been a hard sell, and it still is—especially among practical people in business and politics, who expect every presentation to end with a single slide consisting of five bullet points labeled “The Solution.”

We have no such slide, unfortunately. But in defense of Socrates, humility is the foundation of the fox style of thinking and much research suggests it is an essential component of good judgment in our uncertain world. It is practical. Over the long term, it yields better calibrated probability judgments, which should help you affix more realistic odds than your competitors on policy bets panning out.

Humble works. Or it is at least superior to the alternative.

Also from this issue

Lead Essay

  • Dan Gardner and Philip E. Tetlock review the not-too-promising record of expert predictions of political and social phenomena. The truth remains that for all our social science, the world manages to surprise us far more often than not. Rather than giving up or simply declaring in favor of populism, however, they suggest several ways to improve expert predictions, including greater attention to styles of thinking as well as a “forecasting tournament” in which different methodologies will compete against one another to gain empirical data about the process. Still, they concede that our ability to predict the future will probably always be sharply limited.

Response Essays

  • Robin Hanson argues that most people aren’t interested in the accuracy of predictions because predictions often aren’t about knowing the future. They are about affiliating with an ideology or signaling one’s authority. The outcomes of predictions have nothing to do with either, of course, especially in the present. He suggests that one way to make predictions more accurate might be to lift both the social stigma and legal prohibitions against gambling. Unlike mere predictions, wagers carry real consequences for those who make them. Which, Hanson argues, they should.

  • John H. Cochrane offers a limited defense of the hedgehogs: Economics is full of uncertainty because the agents within the system are aware of the theories and possible actions of the other agents. Trying to capture all of them produces a hopeless muddle. Instead, what are needed are explanations of principle and the tendencies that arise all other things being equal. This calls for a hedgehoggy worldview after all. “Especially around policy debates,” he argues, “keeping the simple picture and a few basic principles in mind is the only hope.”

  • We should not be surprised when experts fail to predict the future, says Bruce Bueno de Mesquita. Expertise doesn’t mean good judgment; rather, expertise is an accumulation of many facts about a subject. That we commonly prefer the pronouncements of experts suggests a bias in favor of “wisdom” and against the scientific method. He argues that statistically rigorous game theory can do better by examining the beliefs and objectives of major players in a given situation, and he welcomes forecasting tournaments as a means of refining the method.