Paper: Challenger: Fine-Tuning the Odds Until Something Breaks

Many of us have, at some point heard about Diane Vaughan’s Normalization of Deviance, a theory elaborated after the Challenger disaster in 1986. Deviance stands for a change from the norms, usually towards less safety. Normalization refers to the overall acceptance by an organization that establishes new norms. Other related models developed at different times can include the concepts of drift or practical sailing, which all aim to model that behaviour.

I recently attended Lund’s Human Factors and Systems Safety Learning Lab as part of their MSc pro…

I recently attended Lund’s Human Factors and Systems Safety Learning Lab as part of their MSc program. The week covered a lot of theory, and one of the surprising bits I learned about was the theory of Fine-Tuning. Fine-tuning was coined by William H. Starbuck and Frances J. Milliken in Challenger: Fine-Tuning the Odds Until Something Breaks, also in the aftermath of Challenger, but in the Journal of Management Studies, in 1988.

I found it interesting because among the many models concerning themselves with the history of organizations, it is the most that most accurately reflects the dynamics I’ve encountered in the software industry, particularly in the startup space where safety is not necessarily a priority, but an openly negotiable property.

At a high level, fine-tuning is described as the process that results from engineers and managers pursuing partially inconsistent goals while trying to learn from experience. It relies on three theories, any of which might be adopted by people:

Past successes or failures do not impact future successes or failures: this is the statistical approach in independent processes. The previous coin flip has no impact on the next coin flip. If your risk analysis predicts certain rates of failure, you should expect outcomes in line with it.
Past successes make future successes less probable, past failures make future failures less likely: this is an approach where you assume that being successful causes people to let up and stop making continuous adjustments. Failures, on the other hand, encourage people to make adjustments to prevent recurrence.
Past success makes future success more likely, past failures make future failures more likely: this theory aligns with the idea that success comes from competence, and failures reveal deficiencies.

Each of these strategies can vary over time, parts of organizations, and have criticism against them. For example, for Theory 1, in sociotechnical systems, the hardware, procedures, or people’s knowledge have to remain mostly unchanged for it to be true; generally, they change. Theory 2 is often adopted after failures, but less so after successes, where tweaks are seen as improving efficiency rather than eroding safety. Theory 3 relies on learning mechanisms (which are not guaranteed to provide good or bad safety results):

These learning mechanisms – buffers, slack resources, and programs – offer many advantages: they preserve some of the fruits of success, and they make success more likely in the future. They stabilize behaviors and enable organizations to operate to a great extent on the basis of habits and expectations instead of analyses and communications.

[...]

But these learning mechanisms also carry disadvantages. In fact, each of the advantages has a harmful aspect. People who are acting on the basis of habits and obedience are not reflecting on the assumptions underlying their actions. People who are behaving simply and predictably are not improving their behaviors or validating their behaviors’ appropriateness. Organizations that do not pay careful attention to their environments’ immediate demands tend to lose track of what is going on in those environments. Organizations that have discretion and autonomy with respect to their environments tend not to adapt to environmental changes; and successful organizations want to keep their worlds as they are, so they try to stop social and technological changes.

In short, for Theory 3, despite the ability to learn and adjust, the same mechanisms often result in people not seeing problems, threats or opportunities.

This is where the paper takes multiple pages describing the history of Challenger’s o-rings, and how they have been handled and re-engineered over time. I’ll skip it for the sake of focusing on fine-tuning itself. One thing the authors mention is that it appears people at NASA followed Theory 3 (their own past successes were seen as signs of ongoing successes), which ironically makes observers see Theory 2 (past successes increase risk of failure) as more realistic:

The organization’s members grow more confident, of their own abilities, of their managers’ skill, and of their organization’s existing programmes and procedures. They trust the procedures to keep them apprised of developing problems, in the belief that these procedures focus on the most important events and ignore the least significant ones.

In gaining confidence with their technology, NASA went from an experimental to an operational mindset, reduced testing and maintenance, while increasing payloads and efficiency. This is where fine-tuning is introduced.

Fine-tuning is an optimization process, based on negotiating tradeoffs:

Although an organization is supposed ’to solve problems and to achieve goals, it is also a conflict-resolution system that reconciles opposing interests and balances countervailing goals. [...] Further, every serious problem entails real-world contradictions, such that no action can produce improvement in all dimensions and please all evaluators.

[...]

Opposing interests and countervailing goals frequently express themselves in intraorganizational labour specializations, and they produce intraorganizational conflicts. An organization asks some members to enhance quality, some to reduce costs, and others to raise revenue; and these people find themselves arguing about the trade-offs between their specialized goals. The organization’s members may seek to maintain internal harmony by expelling the conflicts to the organization’s boundary, or even beyond it. [...] But conflicts between organizations destroy their compatibility, and an organization needs compatibility with its environment just as much as it needs internal cohesion. Intraorganizational conflict enables the organization to resolve some contradictions internally rather than letting them become barriers between the organization and its environment.

Basically, as conflicting goals get assigned to distinct people, these people adopt their goals and end up embodying their related conflict. These conflicts need to be resolved. The authors assert that NASA’s issue around Challenger showed both these cross-organization issues (between NASA and Thiokol) but also within organizations as managers and engineers opposed each other. The engineers valued safety (with wide margins) and managers sought efficiency.

Fine-tuning is what happens when organizations navigate these conflicts and resolve them over time. Safety factors are wasteful if they are truly safety factors and not needed (four spare tires might be overkill in a car), and may also be a source of hazards and complexity that could tax other components. So if a design ships with large safety factors, they will need to be reduced:

An initial design is only an approximation, probably a conservative one, to an effective operating system. Experience generates information that enables people to fine-tune the design: experience may demonstrate the actual necessity of design characteristics that were once thought unnecessary; it may show the danger, redundancy, or expense of other characteristics; and it may disclose opportunities to increase utilization. Fine-tuning compensates for discovered problems and dangers, removes redundancy, eliminates unnecessary expense, and expands capacities. Experience often enables people to operate a sociotechnical system for much lower cost or to obtain much greater output than the initial design assumed.

They add that as engineers would expect managers to cut costs, they would in turn pad the numbers further; as managers tend to have the responsibility of resolving goal conflicts, it is unsurprising that they often go against their engineers. The authors add:

Formalized safety assessments do not resolve these arguments, and they may exacerbate them by creating additional ambiguity about what is truly important. Engineering caution and administrative defensiveness combine to proliferate formalized warnings and to make formalized safety assessments unusable as practical guidelines.

For example, Challenger included 8,000 critical components, 277 of which were involved at launch time. Paying "exceptional attention" to this many items is difficult. This list is also not stable. Over time, elements such as these were in play:

The Thiokol design for o-rings was inspired by one borrowed from the Titan solid rocket booster and had no serious problem
Thiokol nevertheless added a second o-ring for redundancy
The criticality of the joints got reduced over time—managers thought a failure was roughly impossible by then
Changes were made to the system to trim the booster’s weight by 2% and increase its thrust by 5%
Shuttle flights showed no problems even when the flights showed imperfect seals or o-ring damage
Improvements were planned in the following years, but the thought of managers is that the o-rings were less dangerous than engineers assumed

Some changes from this list might have gone back as far as 1982, meaning their contribution to an erosion of safety took more than 4 years to be undeniable.

However the authors warn us not to jump to conclusions:

The most important lesson to learn from the Challenger disaster is not that some managers made the wrong decisions or that some engineers did not understand adequately how O-rings worked: the most important lesson is that fine-tuning makes failures very likely.

Fine-tuning changes always have plausible rationales, so they generate benefits most of the time. But fine-tuning is real-life experimentation in the face of uncertainty, and it often occurs in the context of very complex sociotechnical systems, so its outcomes appear partially random. [...]

Fine-tuning changes constitute experiments, but multiple, incremental experiments in uncontrolled settings produce confounded outcomes that are difficult to interpret. Thus, much of the time, people only discover the content and consequences of an unknown limitation by violating it and then analysing what happened in retrospect.

Fine-tuning can be seen as experiments that probe the limits of knowledge, which means they will keep going so long as consequences are acceptable. It is difficult, if not sometimes impossible to detect the proper effect of this process outside of its clear benefits (it is obvious when you increase thrust and reduce weight, but far less visible how this may affect o-rings even when tested). These discoveries happen across complex sociotechnical systems where active goal-conflict negotiation happens with varied perspectives. As such, only larger scale failures can suspend the process and bring it to a temporary halt.

The authors conclude that we must learn from disasters:

We may need disasters in order to halt erroneous progress. We have difficulty in distinguishing correct inferences from incorrect ones when we are making multiple, incremental experiments with incompletely understood, complex systems in uncontrolled settings; and sometimes we begin to interpret our experiments in erroneous, although plausible frameworks. Incremental experimentation also produces gradual acclimatization that dulls our sensitivities, both to phenomena and to costs and benefits. [...]

[...] We benefit from disasters only if we learn from them. Dramatic examples can make good teachers. They grab our attention and elicit efforts to discover what caused them, although few disasters receive as much attention as Challenger. In principle, by analysing disasters, we can learn how to reduce the costs of failures, to prevent repetitions of failures, and to make failures rarer.

But learning from disasters is neither inevitable nor easy. Disasters typically leave incomplete and minimal evidence. [...] Retrospection often creates an erroneous impression that errors should have been anticipated and prevented. [...] Effective learning from disasters may require looking beyond the first explanations that seem to work, and addressing remote causes as well as proximate ones

There are a lot of posts on here about what can help make learning effective, but as a personal comment, I found it very interesting to have this concept of fine-tuning be introduced. As mentioned earlier, the dynamics of ongoing refinement, adding or removing to a system to make it more efficient, economical, or reliable based on experiments, very much lines up with my experience with the tech industry. Even the idea of error budgets in typical SRE speak fit this definition much better than parallel concepts such as Normalization of Deviance, Drift, or Practical Sailing. This doesn’t mean that none of them can coexist within organizations or parts of the industry, but clearly stating that fine-tuning is how systems negotiate their conflicts and that it may be behind disasters was quite interesting. We may often start from a minimal version (MVP) that improves, but the process is similar.

Similar Posts