At Statuspage, we believe it’s never too early for a team to start thinking about an incident communications plan. When your first big incident happens is way too late. Unplanned downtime can cause customer churn and unmanageable inbound support volume. Just one hour of unplanned downtime can cost organizations more than $100,000—and often much more—according to the latest annual downtime survey from Information Technology Intelligence Consulting Research.
Some downtime is inevitable, even massive organizations experience outages from time to time. The good news is the harm from downtime can be mitigated by deploying reassuring context and information in a timely fashion. You may hope to never need an incident communications plan but, as any good Site Reliability Engineer (SRE) will tell you, hope is not a strategy.
Mapping out your team’s first incident communications strategy doesn’t have to be overly complex or resource-draining. In fact, you can accomplish it fairly quickly using these four steps:
Sometimes it’s hard to know what exactly to label as an “incident.” Here’s a set of guidelines Google SREs use, where if any one of the following is true the event is considered an incident:
Feel free to adopt these exact guidelines, adjust them, or write your own. “If any one of the following is true” is a good format. (Another helpful resource for mapping incident severity is this Severity Definitions guide from VMware.)
A note on playing it safe: in our experience it’s better to overcommunicate in situations where you’re uncertain. The inconvenience of closing the loop on an expected incident that never took off far outweighs the downside of playing catch up on incident comms hours into an incident.
“I’ll just fix this quickly before anyone notices,” is a slippery slope. You might gamble and win the first time you try that, but play the game enough and eventually you’ll lose.
Define key roles and expectations for incident responders. Clear labels and expectations can prevent a lot of damage in the heat of an incident. While large teams and complex SRE organizations have a web of roles and responsibilities, we see two roles as a good starting point.
The incident commander is in charge of the incident response, making sure everyone is working toward resolution and following through on their tasks. They also are in charge of setting up any communications and documentation channels for the incident. That could be chat rooms, shared pages for documenting the incident, and even physical spaces in the office. This person also drives the post-incident review.
The communicator is in charge of translating the technical information into customer communications and getting those communications out via the right channels. They also monitor incoming customer communications and notify the incident commander if new groups of customers become impacted. After the incident, they ensure the post-mortem gets sent out.
Our recommendation: make it clear from the beginning who has what role in an incident. Even if these people have the bandwidth to help with other areas of the incident, they should respond to these primary objectives first and delegate other tasks where necessary.
With a lean team, any time saved during an incident means a lot. Figuring out the right way to wordsmith an announcement can take up precious time in the heat of an incident.
Decide on boilerplate language ahead of time and save it in a template somewhere. Use it to plug in the relevant details during an incident when you need it.
Here is one incident template we use here for our own status page:
"The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We're investigating the cause and will provide an update as soon as possible.”
This language is very simple and generic, and can be deployed as-is in a lot of cases where this is all we know. We can also amend the language to add more relevant details if we have them. For example:
“The site is currently experiencing a higher than normal amount of load due to an incident with one of our larger customers. This is causing about 50% of pages to be unresponsive. We're investigating the cause and will provide an update as soon as possible.”
You should also define your communications channels during an incident. While we obviously recommend Statuspage, there are a lot of tools you can use: Twitter, email, and company blog, as examples. Just make sure you’re clear where you will be posting messages.
Once the incident begins, we recommend these three “golden rules” which are worth keeping in mind during the incident.
It’s important to communicate as soon as there is any sign that the incident is impacting customers. Get a message posted as early as possible. It doesn’t have to be perfect. This message serves to reassure users that you’re aware of the issue and actively looking into it. This will also slow down the flood of support tickets and inbound messaging you’re sure to receive during incidents.
When you’re heads-down working on an incident, it can be easy to let regular updates slide. But these long gaps between updates can cause uncertainty and anxiety for your customers. They can start to expect the worst. Even if you’re just updating to say that you’re still investigating the matter, that’s better than no communication. Bonus points if you give an estimate on when next comms will be (and stick to it).
Here’s an example a 2016 HipChat incident.
In your messaging during the incident, be as precise as you can be without guessing or giving non-committal answers.
“We think we know what’s going on but we need more time.”
“We’re still working to verify the root cause.”
“The problem seems to be database related.”
“We’re continuing to investigate the problem.”
At first glance this second example may seem counterintuitive. Why leave out the fact that the issue could be database related? Because you aren’t sure yet. Avoiding hedging words like “we think.” Don’t say you “think” you found the root cause. Either you actually have found the cause or you haven’t.
Once you’ve confirmed the cause, then clearly state as much detail as you’re able to.
“We’ve identified a corruption with our database related to our last deploy. We are currently rolling back that deploy and monitoring results.”
Some of the biggest opportunities for your team come in the moments after the dust settles from an incident. Your team ideally will run a Post Incident Review session to unpack what happened on the technical side. It’s also a great time to build customer trust by letting them know that you’re taking the incident seriously and taking steps to ensure it doesn’t happen again.
An incident post-mortem is meant to be written after the incident and give a big picture update of what happened, how it happened, and what steps the team is taking to ensure it isn’t repeated. Here are our post-mortem rules.
Apologize for the inconvenience, thank customers for their patience, and ensure you’re working on a fix.
We see it all the time where teams depersonalize themselves in an effort to seem professional or official. This leads to a cold, distant tone in post-mortems that doesn’t build trust.
Use active voice and “we” pronouns to tell your story. Steer away from words that are overly academic or corporate sounding when simple ones will do.
“Remediation applications on the new load balancer configurations are finalized.”
“We’ve completed the configuration on the new load balancer.”
People have a good sense for when you’re using a lot of words but not really saying anything. Details are the way to keep your post-mortem from sounding like a lot of hot air.
Here’s an example from a post-mortem Facebook engineers posted after a 2010 incident.
Consider this paragraph:
“Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.”
Likely that’s more technical of an explanation than most readers will need. The ones who want this level of detail will appreciate it. The ones who don’t will at least recognize that you’re going above and beyond to explain what happened. A lot of teams worry about being too technical in their messaging and instead wind up sending watered-down communications. Opt for specific details instead.
The post-mortem is your chance to have the last word in an incident. Leave the reader with a sense of trust and confidence by laying out clearly what you’re doing to keep this from happening again.
“In the process of resolving the incident, we replaced the original redis cluster that triggered the incident. The incorrect configuration for redis-master was identified and corrected. As a further preventative measure, Redis restarts on redis-master are disabled and future redis-master recoveries will be accomplished by pivoting a slave.
The simultaneous loss of in-flight balance data and the ability to update balances also exposed a critical flaw in our auto-recharge system. It failed dangerously, exposing customer accounts to incorrect charges and suspensions. We are now introducing robust fail-safes, so that if billing balances don’t exist or cannot be written, the system will not suspend accounts or charge credit cards. Finally, we will be updating the billing system to validate against our double-bookkeeping databases in real-time.”
Notice how specific this is with outlining what went wrong and exactly what the team is putting in place to keep the problem from repeating.
Even though users today expect 24/7 services that are always up, people are tolerant of outages. We’ve heard a lot of stories about outages over the years at Statuspage, nobody ever went out of business by being too transparent or communicative during an incident. Consider the kind of information and transparency you’d like to receive from the products and vendors you use, and try to treat your users the way you’d like to be treated.