A while ago I wrote a post on Learning to Fail inspired largely by Michael T. Nygard’s book titled “Release It”. Now it’s time to review the book itself.
As the sub-title says, the book is all about designing and deploying production-ready software. It opens with a great introduction on why it really matters: Because software often is critical to business. Because its reliability and performance is really our job and matter of professionalism. Finally (if that’s not enough), because its behavior in production will have huge impact on our quality of life as well – matter of choosing between panic attacks and phone ringing at 4 AM, or software Just Working by itself, letting you enjoy healthy life and doing more fun stuff at work. That’s the center of mass here, by the way: More on development and operations, less on management and business.
The book is divided into four main areas. Each starts with a bit of theoretical introduction and/or an anecdote, followed by discussion of concrete phenomena, problems and solutions. Even though it might appear as a collection of patterns and antipatterns, it’s much more than that. Patterns and antipatterns are just a form, but it’s really about setting the focus for a few pages and naming the problem. Anyway, the “pattern” and “antipattern” concept is gone by the middle of the book.
The first part talks about stability, and how it’s impacted by error propagation, lack of timeouts, all kinds of poor error handling, weaker links etc. Then it shows solutions: How to stop errors from propagating. How to be paranoid, expect failure in each integration point (with 3rd party and not), and deal with them. How to fail fast. And so on.
The second part talks about capacity: Dealing with load, understanding constraints and making predictions. Impact from seasonal phenomena or ad campains. Strange and not obvious usage patterns – hitting “refresh” button, web scrapers etc. Finally, dealing with those issues with proper use of caching, pooling, precomputing content and tuning garbage collection.
The third part is a bag with all kinds of design issues: networking, security, availability (understanding and defining requirements, followed by load balancing and clustering), testing and administration.
The last part is all about operations: logging, monitoring, transparency, releasing, that kind of stuff. How to organize it so that routing maintenance will be less pain, monitor will let us detect issues early, and finally after or during an issue we will have enough information to diagnose it.
Some problems are discussed from bird’s eye view. Most problems are more down-to-earth, providing detailed discussion of an issue with a sketch for solution with its weak and strong points. Finally, when applicable, author rolls up his sleeves and is ready to talk about concrete code, SQL, heapdumps, scripting etc.
The book is actually full of real war stories, anecdotes, code samples, tool descriptions, case studies, and all kinds of concrete content. There are a few larger stories that go on like this: On this project the team did this, this and that in order to migate such and such risks. When marketing sent an advert, or when the system was launched, or during routine maintenance, this and this broke and started causing problems. We did heapdumps, monitored traffic and contents, read or decompiled the code etc. and discovered problems there and there. Finally, we solved them with this and that. And here comes the detailed list of trouble spots and ways to mitigate them. It’s really a complete view – from business perspective and needs, down to nitpicking about particular piece of code or discussing popular tools.
Apart from being a great collection of real problems and tricks, there is one longer lasting, recurring aspect that may be the most valuable lesson here. Michael T. Nygard regularly shows (and makes you feel it deep in your guts, especially if you did some maintenance in production) that you really should be expecting failure everywhere and every time. You should try and predict, and mitigate, as many issues as possible, as early as possible. You should be paranoid. More than that, embrace the fact that you will fail to predict everything, and design so that even random unpredictable failures won’t take you down and may be easier to solve.
All the time it’s very concrete and complete. It also feels very professional, genuine and even inspiring.