I was also a Physics student at the time and had studied the work of Richard Feynman in both high school and university, so I was even more interested when he was appointed to the Rogers Commission that investigated the root causes of Challenger's loss. I remember very well watching video clips of the proceedings, and at one point there was some debate about the pliability of the O-rings at low temperatures. While experts were opining about this and that, Dr. Feynman famously put a sample of the O-ring material in his glass of ice water, let it sit for a few moments then took it out and showed how it had lost pliability. Once again, one experiment is worth a thousand expert opinions.
Fast forward to 2004, and I was a presenter at the DPI Canada's Professional Development Week in Ottawa. After I had given my session on Transitioning to Agile, I attended one of the keynote talks given by Mike Mullane. Mike is a former Shuttle astronaut, and knew the Challenger crew (and some of the Columbia crew who died in 2003). He gave a talk called Countdown to Teamwork, which was funny and inspiring. In that talk, I was introduced to a term that has stuck with me for 7 years and I believe is one that the software world needs to learn and to which it should pay heed.
That term was Normalization of Deviance.
In the ensuing years I've poked around at the term, discussing it with others in the software and Agile community, and often speaking about it at clients. After some quick research I found that the term had been coined by sociologist Diane Vaughan, while writing her book The Challenger Launch Decision. Ms. Vaughan had spent many years investigating the culture of NASA and attempting to find the root cause or causes of what led to the loss of Challenger.
She wrote of how the culture at NASA had become so focused on hitting launch dates that once unacceptable situations or conditions had become acceptable risks, mainly because nothing bad had happened yet. When first built, the O-rings on the solid rocket boosters were to have no erosion at all by the hot gases from inside the combustion chamber. However, on each flight there was some erosion occurring. The engineers made some changes and the erosion, while it still occurred, was stable.
In other words, a once unacceptable condition - erosion of the O-rings - was now deemed acceptable. The deviance had been normalized. Management even applied spin to the process... an O-ring that had been eroded by 1/3 of it's diameter was deemed to have a "safety factor" of 3!
Twenty-five years ago, in 1986, it was unseasonably cold in the Cape Canaveral area of Florida with temperatures dropping below freezing during the night of January 27th and into the morning of January 28th. the Challenger sat on the launch pad, receiving a "cold soak". Remember Dr. Feynman's experiment? Well, in the cold temperatures the O-rings didn't flex like they were supposed to, and what had been partial erosion of the O-rings became a complete breach, leading to the destruction of the Challenger 73 seconds after liftoff and the loss of the astronauts on board.
So, what does all of this have to do with software development? How does Normalization of Deviance apply?
Ask yourself this question: when did anything more than 0 defects in software become acceptable, and then expected?
We have normalized the deviance of improperly built software to the point that people are actually nervous if no defects are found. There's a saying among the test community: No program has 0 defects - it only has ones we haven't found yet!
I know what you're thinking... "But, Dave, you're being naive. We only need to ship something good enough to market in order to be successful. Look at Windows 95!! It simply isn't cost effective to write perfect or near perfect software."
Or, possibly... "But, Dave, you're being naive. We aren't building web sites here - we build (insert their product here).
I may have bought those arguments back when I was first starting to get into the software development profession, which was about the time Challenger was destroyed. My experience since, and certainly over the past 10 years that I've been using XP and Agile methods, says that it isn't only achievable and cost effective but it may become necessary for society's sake.
From a cost perspective, look at your team or organization. How much time did you spend fixing defects in the last 12 months. Did you miss any deadlines or have to remove promised features from a release at the last minute because the massive testing effort was still finding defects days before release? How many field issues do you get from customers? Do you need a support team as big as your development team?
All in the name of short term cost savings.
The issues with the O-rings in the Space Shuttle's solid rocket boosters were known in 1977, 9 years before Challenger was lost. The segmented design of the boosters was a problem from the start. So why was this design used? NASA had proposed a single segment design in the first place, but Congress balked at the cost. The problematic multi-segmented design was the lower cost option.
Endeavor, the shuttle that replaced Challenger, cost about 2 billion US dollars to build. The Shuttle program was halted for 2.5 years, and many design changes were made to the shuttle fleet to improve safety. The halt and these changes also cost hundreds of millions of dollars.
If Congress has approved the higher funding in the first place, it would have cost a fraction of that. Seven astronauts likely would still be alive today as well.
So, think about all this when someone tells you that you can't possibly use Test-Driven Development because writing all those tests takes too long. Think about it when someone questions why you're wasting company time by Pair Programming. Think about it when it's suggested that automating Acceptance Tests will be expensive because it takes so long and you won't be able to ship as many features. Think about it when someone tells you not to waste time Refactoring because the code is good enough already. Think about it when someone doesn't want you to spend time automating a build because it's complex and will eat a few days of your time. Think about it when someone says that defect free software is a fallacy.
We know how to write code that is near defect free, and we know that it makes us go faster and thus costs less in the medium and certainly the long term. My own experience is that in time periods of anything greater than a week or two it's faster and thus less costly to just "do it right" than to cut corners and hope for the best.
So much of our society today relies on software that we can no longer afford to think that we can't afford to write defect free software. We have normalized the deviance of that single first defect because nothing really, really bad has happened yet. All it took was a cold day in January 25 years ago to prove what happens when we become complacent with those sorts of risks.