I’ve just watched Sidney Decker’s “System Failure, Human Error: Who’s to Blame” talk from DevOpsDays Brisbane 2014. It’s a very nice and worthwhile talk, though there is some noise.
You can watch it here:
It covers a number of interesting stories and publications from the last 100 years of history related to failures and disasters, their causes and prevention.
Very quick summary from memory (but the video surely has more depth):
- Shit happens. Why?
- Due to human physical, mental or moral weaknesses – claim from early XX century, repeated till today.
- One approach (equated to MBA): these weak and stupid people need to be told what to do by the more enlightened elites.
- Bad apples – 20% people are responsible for 80% accidents. Just find them and hunt them down? No, because it’s impossible to account for different conditions of every case. Maybe the 20% bus drivers with the most accidents drive in busy city center? Maybe the 20% doctors with most patient deaths are infant surgeons – how can we compare them to GPs?
- Detailed step-by-step procedures and checklists are very rarely possible. When they are, though, they can be very valuable. This happens mostly in industries and cases backed by long and thorough research – think piloting airplanes and space shuttles, surgery etc.
- Breakthrough: Maybe these humans are not blame? Maybe the failures are really a result of bad design, conditions, routine, inconvenience?
- Can disasters be predicted and prevented?
- Look for deviations – “bad” things that are accepted or worked around until they become the norm.
- Look for early signs of trouble.
- Design so that it’s harder to do the wrong thing, and easier and more convenient to do the right thing.
A number of stories follows. Now, this is a talk from DevOps conference, and there are many takeaways in that area. But it clearly is applicable outside DevOps, and even outside software development. It’s everywhere!
- The most robust software is one that’s tolerant, self-healing and forgiving. Things will fail for technical reasons (because physics), and they will have bugs. Predict them when possible and put in countermeasures to isolate failures and recover from them. Don’t assume your omnipotence and don’t blame the others, so it goes. See also the Systems that Run Forever Self-heal and Scale talk by Joe Armstrong and have a look at the awesome Release It! book by Michael Nygard.
- Make it easy for your software to do the right thing. Don’t randomly spread config in 10 different places in 3 different engines. Don’t require anyone to stand on two toes of their left foot in the right phase of the moon for doing deployment. Make it mostly run by itself and Just Work, with installation and configuration as straightforward as possible.
- Make it hard to do the wrong thing. If you have a “kill switch” or “drop database” anywhere, put many guards around it. Maybe it shouldn’t even be enabled in production? Maybe it should require a special piece of config, some secret key, something very hard to happen randomly? Don’t just put in a red button and blame the operator for pressing it. We’re all in the same team and ultimately our goal is to help our clients and users win.
The same principles apply to user interface design. Don’t randomly put in a bunch of forms and expect the users to do the right thing. If they have a workflow, learn it and tailor the solution that way. Make it harder for end users to make mistakes – separate opposite actions in the GUI, make the “negative” actions harder to execute.
Have a look at the above piece of Gmail GUI. Note how the “Send” button is big and prominent, and how far away and small the trash is. There is no way you could accidentally press one when you meant the other.
Actually, isn’t all that true for all the products that we really like to use, and the most successful ones?
Here’s a summary of a few interesting technical details from a very good presentation on designing REST APIs by Les Hazlewood. Many interesting, elegant and not so obvious solutions here.
Keep resources coarse-grained, especially if you’re designing a public API and you don’t know the use cases.
- Resources are either collections or single instances. They’re always an object, a noun. Don’t mix the URLs with behavior. /getAccount and /createDirectory can easily explode to /getAllAccounts, /searchAccounts, /createLdapDirectory…
- Collection is at /accounts, instance at /accounts/a1b2c3. Create, read, update and delete with GET/POST/PUT/PATCH/DELETE… Everyone knows, but anyway.
- Media types:
- application/json – regular JSON.
- application/foo+json – JSON of type foo. In my understanding, roughly corresponding to XML schema or DTD.
- application/foo+json;application – JSON of type foo, where the response type (entity format?) is application.
- application/json, text/plain – JSON prefered, text acceptable.
- Media-Type application/json+foo;application&v=1
- Use instead of IDs
- Sample use:
// Instance reference
// Collection reference
- Reference (link) expansion:
- Partial representations:
- Many-to-many – represent as another, special resource, e.g. /groupMembership. Not so obvious potential benefits: If you make it a resource, it’s easy to reference from either end of the association. If you want, you can easily add data to the association (e.g. creation date, responsible user, whatever).
- Error handling – credit to Twilio.
"code": 40924, // because HTTP has too few codes
// Ready-to-use user-friendly message:
"message": "A directory named Avengers already exists",
// Dev-friendly message, can be stacktrace etc.
"developerMessage": "A directory named Avengers already
exists. If you have a stale local cache, please expire
- Security – many interesting points here, but one particularly valuable. Authorize on content, not URL. URLs can change and may diverge from security configuration…
It’s too bad that the presenter doesn’t really leave the happy CRUD path. I’d really like to hear his recommendation or examples on more action-oriented use cases (validate or reset user password, approve order, what have we).
The entire presentation is a bit longer than that. I talks quite a lot about why you would want to use REST and JSON, as well as security, caching and other issues.
I recently saw a great presentation by Joe Armstrong called “Systems that run forever self-heal and scale” . Joe Armstrong is the inventor of Erlang and he does mention Erlang quite a lot, but the principles are very much universal and applicable with other languages and tools.
The talk is well worth watching, but here’s a few quick notes for a busy reader or my future self.
If you want to run forever, you have to have more than one instance of everything. If anything is unique, then as soon as that service or machine goes down your system goes down. This may be due to unplanned outage or routine software update. Obvious but still pretty hard.
There are two ways to design systems: scaling up or scaling down. If you want a system for 1,000 users, you can start with design for 10 users and expand it, or start with 1,000,000 users and scale it down. You will get different design for your 1,000 users depending on where you start.
The hardest part is distributing data in a consistent, durable manner. Don’t even try to do it yourself, use known algorithms, libraries and products.
Data is sacred, pay attention to it. Web services and such frameworks? Whatever, anyone can write those.
Distributing computations is much easier. They can be performed anywhere, resumed or retried after a failure etc. There are some more suggestions hints on how to do it.
Six rules of a reliable system
Isolation – when one process crashes, it should not crash others. Naturally leads to better fault-tolerance, scalability, reliability, testability and comprehensibility. It all also means much easier code upgrades.
Concurrency – pretty obvious: you need more than one computer to make a non-stop system, and that automatically means they will operate concurrently and be distributed.
Failure detection – you can’t fix it if you can’t detect it. It has to work across machine and process boundaries because the entire machine and process can’t fail. You can’t heal yourself when you have a heart attack, it has to be external force.
It implies asynchronous communication and message-driven model.
Interesting idea: supervision trees. Supervisors on higher levels of the tree, workers in leaves.
Fault identification – when it fails, you also need to know why it failed.
Live code upgrade – obvioius must have for zero downtime. Once you start the system, never stop it.
Stable storage – store things forever in multiple copies, distributed across many machines and places etc.
With proper stable storage you don’t need backups. Snapshots, yes, but not backups.
Others: Fail fast, fail early, let it crash. Don’t swallow errors, don’t continue unless you really know what you’re doing. Better crash and let the higher level process decide how to deal with illegal state.
Actor model in Erlang
We’re used to two notions of running things concurrently: processes and threads. The difference? Processes are isolated, live in different places in memory and one can’t screw the other. Threads can.
Answer from Erlang: Actors. They are isolated processes, but they’re not the heavy operating system processes. They all live in the Erlang VM, rely on it for scheduling etc. They’re very light and you can easily run thousands of them on a computer.
Much of this is very natural in functional programming. Perhaps that’s what makes functional programming so popular nowadays – that in this paradigm it’s so much easier to write reliable, fault-tolerant scalable, comprehensible systems.