I’ve just watched Sidney Decker’s “System Failure, Human Error: Who’s to Blame” talk from DevOpsDays Brisbane 2014. It’s a very nice and worthwhile talk, though there is some noise.
You can watch it here:
It covers a number of interesting stories and publications from the last 100 years of history related to failures and disasters, their causes and prevention.
Very quick summary from memory (but the video surely has more depth):
- Shit happens. Why?
- Due to human physical, mental or moral weaknesses – claim from early XX century, repeated till today.
- One approach (equated to MBA): these weak and stupid people need to be told what to do by the more enlightened elites.
- Bad apples – 20% people are responsible for 80% accidents. Just find them and hunt them down? No, because it’s impossible to account for different conditions of every case. Maybe the 20% bus drivers with the most accidents drive in busy city center? Maybe the 20% doctors with most patient deaths are infant surgeons – how can we compare them to GPs?
- Detailed step-by-step procedures and checklists are very rarely possible. When they are, though, they can be very valuable. This happens mostly in industries and cases backed by long and thorough research – think piloting airplanes and space shuttles, surgery etc.
- Breakthrough: Maybe these humans are not blame? Maybe the failures are really a result of bad design, conditions, routine, inconvenience?
- Can disasters be predicted and prevented?
- Look for deviations – “bad” things that are accepted or worked around until they become the norm.
- Look for early signs of trouble.
- Design so that it’s harder to do the wrong thing, and easier and more convenient to do the right thing.
A number of stories follows. Now, this is a talk from DevOps conference, and there are many takeaways in that area. But it clearly is applicable outside DevOps, and even outside software development. It’s everywhere!
- The most robust software is one that’s tolerant, self-healing and forgiving. Things will fail for technical reasons (because physics), and they will have bugs. Predict them when possible and put in countermeasures to isolate failures and recover from them. Don’t assume your omnipotence and don’t blame the others, so it goes. See also the Systems that Run Forever Self-heal and Scale talk by Joe Armstrong and have a look at the awesome Release It! book by Michael Nygard.
- Make it easy for your software to do the right thing. Don’t randomly spread config in 10 different places in 3 different engines. Don’t require anyone to stand on two toes of their left foot in the right phase of the moon for doing deployment. Make it mostly run by itself and Just Work, with installation and configuration as straightforward as possible.
- Make it hard to do the wrong thing. If you have a “kill switch” or “drop database” anywhere, put many guards around it. Maybe it shouldn’t even be enabled in production? Maybe it should require a special piece of config, some secret key, something very hard to happen randomly? Don’t just put in a red button and blame the operator for pressing it. We’re all in the same team and ultimately our goal is to help our clients and users win.
The same principles apply to user interface design. Don’t randomly put in a bunch of forms and expect the users to do the right thing. If they have a workflow, learn it and tailor the solution that way. Make it harder for end users to make mistakes – separate opposite actions in the GUI, make the “negative” actions harder to execute.
Have a look at the above piece of Gmail GUI. Note how the “Send” button is big and prominent, and how far away and small the trash is. There is no way you could accidentally press one when you meant the other.
Actually, isn’t all that true for all the products that we really like to use, and the most successful ones?