On-calls and Incident Response
Before starting in my current role, I worked on design systems for years. One of the perks of that job was that we didn’t have to staff a 24/7 on-call rotation. If we released a version that had issues, the product team would simply roll back these changes so we could patch it the next day.
Today, I am still leading platform teams, except with one significant change in scope: we have Tier 0 production systems that need a 24/7 on-call rotation. The transition back to operating mission-critical systems led me to summarize my thoughts on what makes or breaks engineering on-calls, for anyone finding themselves facing these same challenges.
One of the main goals of staffing your on-call rotation is to ensure customer satisfaction. It’s not just a means to make sure services work, but to make sure customers can use your product effectively.
This is where Service Level Objectives (SLOs) come in. SLOs are built from Service Level Indicators (SLIs) and their associated goals. SLIs measure the level of service provided by any given operation that deals with customers. For most services, the following SLIs might serve as inspiration:
- Latency SLI: the percentage(%) of requests that are handled faster than a given threshold.
- Availability SLI: the percentage(%) of requests that returned a successful response.
You can turn the SLIs listed above into SLOs by assigning a goal for each of them:
- Latency SLO: 95% of the requests are completed in less than 200ms.
- Availability SLO: 99.99% of the requests return a successful response.
Defining SLOs is a collaborative process between your product counterparts and your engineering team. This is an exercise to complete early, so your services have an error budget-based approach rather than the unspoken goal of 100% availability. This allows the engineering team to make design decisions based on their error targets.
Once you’ve defined the goals of your on-call rotation, it’s time to staff it. Most articles and books on the topic will recommend at least eight engineers to staff a two-tiered, 24/7 on-call rotation, since it assures that no engineer remains on-call for more than two weeks in any given two-month period.
A two-tiered, 24/7 on-call rotation is when an on-call shift has a primary and a secondary responder. If the primary doesn’t acknowledge the pager after a certain amount of time, the secondary responder is paged automatically. Most on-call rotations implement a two-tiered system for redundancy.
I prefer to have thirteen engineers on any given on-call rotation, so each individual becomes primary and secondary responders once a quarter. I avoid adding more to the rotation so the engineers can keep their services in context.
Shared on-call rotations are when engineers are pooled from multiple teams to staff an on-call rotation for a single service. While this can bridge staffing gaps and create a safety net for services in the short-term, it is not a feasible strategy for long-term success. Engineers who don’t work on the service daily might miss out on changes required to deliver quality service, which can lead to an escalation of engineers with a better understanding of the services, or at worst, longer service restoration times.
Before an employee is fit to be an on-call responder, they should complete an on-call training and familiarize themselves with your service’s playbook. This is a critical step, since you don’t want on-call engineers handling their first mitigation steps during a live incident.
Playbooks contain essential information for on-call scenarios, such as:
- Links to monitoring systems with crucial information about service health.
- A short explanation of common error scenarios and how to mitigate them. This includes information like adding extra capacity, blocking and rate-limiting traffic, or rolling back a bad deployment.
- Service architecture and dependencies on other services, with contact information to reach people from those teams, if necessary.
- Protocols to follow during an incident. I’ve added more about this subject in the incident response section below.
As playbooks can become obsolete with each new deployment, the on-call engineer is responsible for keeping them updated.
I recommend practicing on-call training with a cadence, even if no new on-call engineer joins the rotation. This ensures that everyone’s understanding of the systems they are responsible for is up-to-date.
Along with your responders’ understanding of the playbook, on-call trainings also help with:
- Incident dry runs: Turn off certain parts of the system in a sandbox environment, observe which alerts are triggered, and see if your mitigation steps from the playbook help. Update the playbook as needed.
- Ensure correct setup of your pager: Try changing your phone’s clock to a different time and sending a test page to make sure it’s not blocked. I’ve seen a few pagers disappear this way, so it’s important to check.
An incident is an event that can lead to a negative impact for customers (or one already hurts customer relations). To gather an incident’s scope, you should define its severity level, which will influence how you prioritize mitigating incidents.
To measure severity levels, I recommend starting with 0 as the most impactful of incidents, and counting upward from there. Priorities will follow your definitions of severity levels. Using the example below, S0 translates into a priority where impacted teams need to stop everything and start working on the incident’s resolution immediately. S1 will be similar, though depending the size of the impacted population, you might have more time to address it. S2 may still require attention, but it becomes a “let’s take a look tomorrow morning” situation instead of preparation for an all-nighter. The following list might serve as inspiration:
- S0: An outage that impacts every customer. These are the problems that could have a negative impact on your entire consumer base, so it’s all hands on deck.
- S1: An outage that impacts a subset of customers. Your business might partition customers based on where they live, or which applications they use to access your products and/or services.
- S2: Any incident that needs attention, but doesn’t impact customers. These may be scenarios which could potentially become worse, but allow for a little time to correct before events affect your bottom line.
During an incident, it’s crucial to have a single person overseeing mitigation steps. This ensures a focused effort to restore site health and clear communication about what to do next.
The technical details of mitigating incidents are outside the scope of this article, but I do want to cover some best practices for communicating during an incident.
First, you must enforce a company-wide standard regarding how to communicate during an incident. Consider creating a public Slack room (or equivalent) for sharing links, images, and additional context, such as the steps you’ve already tried to mitigate an incident. This will create an incident timeline, which serves as a source of truth for the post-mortem. In most situations, it is more effective to use a real-time video app to resolve the incident with direct conversation. The incident commander takes the lead in both scenarios — asking the right questions, and gathering information that will inform the next step.
Secondly, partner with your marketing/PR team to devise a plan to communicate the problem with the customers without incurring liability or speaking off-brand.
After each incident, it is important to use the understanding you’ve gained to prevent the same issues from happening again. One of the best ways to accomplish that is through post-mortems. In business, post-mortem documents and meetings define the incident’s root cause and seek to create action items that address identifiable gaps in service.
A post-mortem document should contain the following sections:
- Timeline of events: This allows everyone reading the post mortem to follow the original team’s logic.
- Root cause: This defines the root cause of the incident, which can be anything from employee mistake to a systemic malfunction.
- Action items: These are the action items the team uses to address the incident’s root cause. Each item should have a clear owner and timeline, so future participants know who to contact with any questions.
Once you have a draft of the post-mortem, it’s time to call for a post-mortem meeting (or present it during your company incident reviews, if any). During this meeting, your team should present the post-mortem and answer any additional questions. Focus on identifying any action items that you may have missed.
Once the post-mortem meeting concludes, you can share any findings with the rest of the engineering organization. Sharing this information allows other teams to figure out their own gaps and prioritize accordingly.
The processes outlined above may or may not work for your organization. The only way to assess improvements to your on-call rotation is by capturing metrics with clear definitions. These metrics should focus on two different aspects of on-calls: service health and employee satisfaction.
For the service health, start with ‘mean time to detect’ (MTTD) and ‘mean time to restore’ (MTTR). MTTD is the time it took to discover an incident in your system from when it’s first reported; MTTR is the time it takes to resolve the incident once it’s discovered.
Regarding employee satisfaction, you should enact retros for the on-call participants at least once a quarter. These are how you find the necessary changes to the rotation or its protocols. Ideally, retros grant a consistent feedback loop through regular one-on-ones and reporting, so you can start to prevent problems instead of reacting to them.
On-call procedures are critical for engineering teams, and they can have a significant impact on the company’s bottom line. A poor on-call operation can quickly become a source of frustration and tension that poisons customers’ impression of your company. The practices outlined above should help you create healthier on-call rotations.
Thank you go to Michael Sitter for upleveling my expertise in incident response!
Originally published at https://nemethgergely.com.