Planning for and Responding to Software Failures

Planning for and Responding to Software Failures

You’ve probably heard Murphy’s law: “Whatever can go wrong, will go wrong.” That sounds hyperbolic, perhaps because it’s a watered-down version of the actual quote: “Things will go wrong in any given situation, if you give them a chance.” That sounds about right!

With proper testing, simulation, and release procedures, your team should be able to avoid most common software system failures. However, despite your best efforts, something important is going to break, and it’s probably going to happen at the worst possible time.

There are plenty of excellent resources available on how to engineer stable, fault-tolerant, resilient software systems (like Google’s awesome free SRE book). This article will focus more on the human side of things, including how to strengthen your development team by helping them learn from failures.

There are a multitude of ways in which software systems can fail. Teams that are scaling up and are resource-constrained often take a lot of shortcuts, which may come back to bite them when volume increases all of a sudden. On teams that grow too quickly, developers may break each other’s code without realizing it, unless their testing is rock solid. Nobody wants to write buggy code or build brittle systems, but in the never-ending balancing act that is agile development, mistakes inevitably get made.

Anticipate Failures with a Pre-Mortem

The best defense can sometimes be a thorough analysis of your weaknesses along with a review of your release plan. A technique called the “pre-mortem” can be very effective at uncovering potential failures before you’ve even pushed the code (see the full pre-mortem procedure here). To hold a pre-mortem, gather your team in a room and block off at least thirty minutes. You’ll need to set the stage with an introduction to get everyone thinking in the right way. The prompt goes like this:

“I want everyone to imagine themselves in the near future, wherein the release that we have planned for tomorrow was a complete and total disaster. Our customers are furious, we’re losing tons of money, and our CEO is pulling her hair out. Now, what went wrong?”.

Basically, the purpose is to give your team permission to be as cynical and paranoid as possible, for the purpose of imagining anything and everything that could go wrong. It can be a little uncomfortable at first, but once ideas start to fly you can uncover some really useful information. Make sure to write everything down and send out a summary of the findings to the team afterward.

How to Lead During an Outage

One quality of an excellent engineering manager is the ability to act quickly, decisively, and without blame when the shit hits the fan [is this ok?]. It can be terrifying when things break and everyone around you starts to panic. Your job is to get to the root of the problem and communicate a plan of action, ASAP.

Resist the urge to point fingers or say “I knew we shouldn’t have deployed to production at four p.m. on a Friday!” Instead, spring into action with these steps:

  • Survey the damage: Can you figure what part of your stack is failing? If not, what steps can you take to dig in and look for the source of the errors? It’s very helpful to split up the work here and have one person look at server logs, another at the database, etc. Don’t be over-reliant on anecdotal accounts from users or other departments. Try to reproduce the errors for yourself.
  • Estimate the impact: Is everything completely on fire, or is it just that one dashboard the Ops guy is screaming about? Issues that affect your customers or cost you lots of money should take precedence over internal tools. Calibrate the urgency of your communication with your team based on the scale of the outage.
  • Communicate: By now, if it’s a serious outage, a lot of your coworkers will know about it. They’ll probably be rightfully freaked out, given that they have no context and little ability to help. Send out a quick message, or, for more complex situations, quickly draft an email explaining the damage and impact, and letting everyone know that you’re on it and that things will be resolved ASAP. And, as you work on resolving the outage, keep your company in the loop.
  • Strategize a plan of attack: What is the shortest path you can take to getting things back online? It’s fine for this to involve a hack or two. This is not the time or place to try to refactor anything. Brainstorm, pick the best ideas, and delegate.
  • Heads down and fix it: Set your ego aside and get it done.

Learn from Failures

When the dust has cleared and things are back to operating at an acceptable level, you can breathe a sigh of relief. Make sure to thank everyone who helped out, and call out anyone who did something particularly heroic or creative. Draft another message to your company explaining the cause of the outage and the steps that were taken to resolve it. If it was a really emotionally draining outage (for example, late at night, long-lasting), you can wait until the next day to meet with your team and let everyone sleep on it.

The final step is to have a calm, rational discussion of what went wrong. Again, avoid the urge to blame individuals for their failures, and make sure your team follows suit. You can’t undo what happened, even if your outage cost you $66k per minute, but this timewhen the trauma is still freshis the perfect opportunity to put better processes in place so that it doesn’t happen again.

Whether you hold a formal post-mortem or just have a quick sync, the key here is to determine which concrete steps you can take in the short term to reduce the chance of something similar happening again.

It can be tempting to say “This whole system needs to be rewritten from the ground up,” but that won’t really get you anywhere. A better action item would be “Research a better database monitoring solution next sprint.” If you identify some changes that can be made right away, it may be necessary to adjust your current sprint timeline to accommodate the new work. That’s OK! Stability and resilience work is critical and can take the place of existing feature work if the latter isn’t super time sensitive.

Software failures are inevitable, and despite your best efforts, things will go wrong. As a manager, the way you help your team prepare for failures, lead them when things go wrong, and learn from failures will set the tone for how they operate under pressure.

The hottest fires make the hardest steel, and a team that can learn to recover gracefully from a bad outage is basically unstoppable.