In April 20, 2010, mud began gushing out of the well onto the drilling floor at Deepwater Horizon. Seconds later a geyser of water and mud sprayed up inside the derrick at the center of the giant rig; gas sensors went off everywhere; and the lights went out. One explosion was followed by a second, bigger blast, and a fireball, hundreds of feet high, enveloped the rig. Eleven workers died in the accident that night, and the blowout that caused the explosions sent 200 million gallons of oil gushing into the Gulf of Mexico in one of the worst environmental disasters in history. BP’s costs associated with the spill: more than $50 billion.
Other catastrophes aren’t physical but digital. When the markets opened on August 1, 2012, the Knight Capital Group was one of Wall Street’s largest traders, but less than an hour later the company was on the brink of collapse. A software glitch had caused the firm’s trading system to go haywire and flood the market with four million unintended orders, resulting in Knight Capital acquiring several billion dollars in unwanted positions. When the firm eventually sold back those stocks, it had lost $460 million — roughly $200,000 per each second of the trading meltdown. By the next day, three-quarters of Knight Capital’s market value had been erased. The firm scrambled to avoid collapse and was eventually acquired by a competitor.
Disasters like BP’s oil spill and Knight Capital’s trading meltdown can threaten the very existence of even the largest of corporations. And such failures aren’t limited to high-stakes, exotic domains like deep-water drilling and electronic trading. From food-safety accidents in restaurant chains to defect recalls affecting car manufacturers, failures abound in ordinary industries and can devastate profits, trigger legal actions, and cause lasting reputational damage.
To make matters worse, these dangers are increasing, according to many business leaders. In a recent survey of more than 1,000 executives in a wide range of industries, nearly 60 per cent reported that the volume and complexity of the risks their organizations face have increased substantially in the past half- decade. At the same time, only a minority reported that their organization had implemented a complete firm-wide process for enterprise risk management.
So, what can executives do to reduce the risk of catastrophic failures in their organizations? Traditional risk management steps — such as instituting rules and controls, scenario planning, and bringing in additional experts — can be quite helpful, but they have their limitations as the complexity increases. For example, a rule-based approach — identifying the things that could go wrong, instituting procedures to prevent them, and enforcing those procedures through monitoring — often fails to capture the breadth of potential risks and may instead foster a punitive culture that causes people to conceal risks. The use of scenario planning to identify risks is a more sophisticated approach, but it has problems of its own, sometimes leading decision makers to focus on a potentially narrow set of risks and responses based on scenarios that are vivid and easy to imagine. All too often, scenario planning also fails to capture the messy complexity of interconnected systems and organizations, as well as the chaos and fallibility of crisis responses. Research in numerous industries likewise reveals fundamental limits of relying on expert ability. For example, teams dominated by subject-matter experts are often vulnerable to group overconfidence and might suppress valuable input from non-expert skeptics. Such group dynamics are especially likely to yield bad outcomes in complex and uncertain environments. At the same time, researchers are increasingly uncovering other interventions that can improve decisions, strengthen complex systems, and reduce catastrophic risks. In our book, Meltdown: Why Our Systems Fail and What We Can Do About It, Chris Clearfield and I discuss several best practices in depth; here’s a summary of a few of them:
- Learn from incidents. In complex systems, it’s impossible to predict all of the possible paths to catastrophe. But even so, there are often emerging signals that can bring to light any interactions and risks that might otherwise be unexpected and hidden. Indeed, a timeline of the weeks and months leading up to a major failure is often a history of smaller failures, near misses, glaring irregularities, and other indications that something might be amiss. Incident tracking is a powerful way to learn from such signals, and there are notable success cases. In healthcare and aviation, for example, effective incident reporting systems help managers sort through the overwhelming haystack of possible warning signs to identify sources of potentially catastrophic errors. In recent years, such systems have proliferated in other industries as well. But these systems are effective only if employees feel safe enough to report issues and if the output is actually used to generate insights and effect change. To do so, it’s essential to designate a specific group, with sufficient understanding of operational concerns, to sort through, analyze, and prioritize incoming information. In the absence of this, insights can be lost even when critical data are available. Moreover, once information is recorded and analyzed, people must use it to generate insights about the root causes of those incidents and to fix problems without delay, rather than simply relegating it to a risk report. Emerging insights can then be disseminated throughout the organization. When used in this way, incident reporting systems can enable decision makers to anomalize, that is, to treat minor errors and lapses as distinctive and potentially significant details rather than as normal, familiar events.
- Encourage dissent. Insiders often have serious reservations about the decisions or procedures in place well before a major accident, but they either fail to share these concerns or are ignored by Many of those who observe these indications – typically, employees on the front lines – feel uncomfortable disclosing errors, expressing dissenting views, and questioning established procedures. To counter these tendencies, it’s important for leaders to cultivate what researcher Amy Edmondson calls psychological safety: a shared belief among team members that the group will not admonish or penalize individuals for speaking up and challenging established procedures or widely held views. Psychological safety requires a climate in which team members trust and respect one another regardless of differences in their formal status. Research has shown that, through their words and actions, executives can do a great deal to foster psychological safety in a team or even within an entire organization. This requires that leaders credibly signal that they are willing to consider and address challenging questions and dissenting voices openly and productively, rather than defensively. These kinds of leadership behaviours help demonstrate that it’s safe to raise questions, to admit mistakes, and to disagree with the team’s consensus – critical steps in understanding where hidden dangers might be lurking in a complex system.
- Use structured decision tools. One way to reduce the number of small errors that might cascade into larger failures is to mitigate the effect of cognitive biases in decision making. The use of structured decision tools, rather than intuitive thinking, can lessen the influence of some of those biases. Cognitive psychologists, for example, have proposed a list of questions that executives can use to detect and minimize the effect of cognitive biases when making major decisions based on a recommendation from their team. For example, is the worst case bad enough? Were dissenting opinions adequately explored? And could the diagnosis of the situation have been overly influenced by salient analogies? Many of these questions are quite straightforward and seemingly obvious but, in practice, they are rarely raised explicitly. A checklist ensures that these questions are actually considered, thus helping executives to apply quality control to their decisions. Similarly, decision tools can also reduce the effect of cognitive biases in predictions. For instance, a simple tool called Subjective Probability Interval EStimates (SPIES) has been shown to produce less overconfident estimates than do unstructured, intuitive forecasting approaches.
- Diversify teams. Teams composed of individuals with diverse professional backgrounds and expertise can be an effective risk management strategy. Research on bank boards, for example, suggests that banks with some non-expert directors – those with a background in other fields such as in law, the public sector, or the military – tend to be less likely to fail than banks with directors who all come from a banking background. It seems that having a mix of industry experts and non-experts can serve as effective safeguard against overconfidence on a board. These outsiders often raise inconvenient questions and force bankers on the board to justify their proposals and explain why formerly unacceptable risks might have become acceptable. In addition, even surface-level diversity – diversity in team members’ visible characteristics like sex, age, and race – might help reduce the overconfidence of decision makers. Recent research, for example, suggests that the mere presence of ethnic diversity can reduce overconfidence in the actions of others, thus fostering greater scrutiny and more deliberate thinking.
- Conduct risk reviews. A risk review is a structured audit of an organization by external investigators who gather qualitative and quantitative data to uncover hidden and unexpected risks to the organization. The investigators, who are typically independent experts on risk management in complex systems and organizations, begin the review by conducting confidential interviews with a variety of personnel at different levels in the organizational hierarchy, from higher-level executives to junior employees working on the front lines. The goal of these interviews is to reveal potential risks that might not be visible at a given hierarchical level or within a particular organizational silo. The interviews can also provide an indication of the willingness of employees to share their concerns and dissenting opinions with supervisors. Next, to examine the most important issues raised in the confidential interview process, the investigators gather additional qualitative or quantitative data from surveys, additional interviews, or the organization’s archives. Because a risk review leverages independent generalist experts and cuts across hierarchical and bureaucratic boundaries within the organization, it’s particularly suitable for uncovering risks that are created by internal decision- making processes and organizational structures.
It’s also an effective guard against risk creep. Although the gradual slide toward increasingly risky practices tends to be imperceptible to insiders, outsiders can often recognize it and help ensure that unacceptable risks are challenged and mitigated. Of course, a risk review will only be effective if executives are open to the investigators’ conclusions, even if that information might occasionally be uncomfortable, disconcerting, and perhaps painful to hear. Otherwise the investigators’ main advantage – their independent external perspective, allowing them to question industry and company assumptions and conventional practices, to poke holes in arguments, and to disagree with the existing consensus – can easily be lost.
- Develop more realistic contingency plans. It’s essential for organizations to develop robust crisis planning and response capabilities. During that process, executives need to recognize that estimates for worst-case scenarios are often explicitly or implicitly built from information that is biased by observations of recent orderly behaviour and the assumption that the mitigations outlined in a crisis response plan will actually work. To identify possible planning failures, decision makers can rely on independent outsiders to stress-test critical estimates in plans, to explore extreme scenarios, and to challenge optimistic assumptions about organizational performance during a crisis. This can lead to more realistic worst-case scenarios and the development of crisis response plans that are more robust. To avoid the pitfall of illusory redundancy, managers should carefully assess whether their backup plans are susceptible to the same risks as their regular operations. Rather than quickly narrowing their focus to the technical merits and challenges of a particular solution, executives should define the broad goals of the intended redundancy and identify counterexamples for which backup measures might also be vulnerable. The goal is for people to shift their perspective and see redundancy as a vulnerable part of the system rather than as an invincible panacea.
These recommendations are not rocket science. They also don’t require large financial investments or expensive technologies. That, however, does not mean that they are easy to implement. Indeed, getting organizations to heed dissenting voices, learn from small anomalies, and open themselves to independent scrutiny can be a difficult leadership challenge. And it’s often extremely hard to change deeply ingrained routines for planning and decision making.
The good news is that these interventions don’t necessarily clash with other key organizational priorities. Although it may seem that paying more attention to risk reduction, accident prevention, and safety will necessarily undermine a firm’s focus on innovation and profits, the above-described solutions can actually enhance multiple organizational objectives. Team psychological safety, for example, is not only an effective safeguard against catastrophic risks but also a critical factor in the effectiveness and creativity of teams, as recent research at Google has revealed. Similarly, interventions that minimize the effect of cognitive biases in decision-making can not only reduce catastrophic risks but might also increase investment returns as well. Better management of catastrophic risks, it seems, can also lead to better management more generally.
About the author
András Tilcsik, who is Hungarian-born, holds the Canada Research Chair in Strategy, Organizations, and Society at the Rotman School of Management and is a faculty fellow at the Michael Lee-Chin Family Institute for Corporate Citizenship. In 2015, he and Chris Clearfield won the Bracken Bower Prize from McKinsey and the Financial Times, given to the best business book proposal by scholars under 35. The book, Meltdown: Why Our Systems Fail and What We Can Do About It is forthcoming (Penguin, 2018). Tilcsik was shortlisted for the 2017 Thinkers50/Brightline Initiative Strategy Award.
This is an excerpt from Strategy@Work, a Brightline and Thinkers50 collaboration bringing together the very best thinking and insights in the field of strategy and beyond. A version of this article originally appeared as “Managing the Risk of Catastrophic Failure” in Survive and Thrive: Winning Against Strategic Threats to Your Business (edited by Joshua Gans and Sarah Kaplan), Dog Ear Publishing, 2017. Reprinted by permission.