On-calls and Incident Response

  • Latency SLI: the percentage(%) of requests that are handled faster than a given threshold.
  • Availability SLI: the percentage(%) of requests that returned a successful response.
  • Latency SLO: 95% of the requests are completed in less than 200ms.
  • Availability SLO: 99.99% of the requests return a successful response.

On-call Playbooks

Playbooks contain essential information for on-call scenarios, such as:

  • Links to monitoring systems with crucial information about service health.
  • A short explanation of common error scenarios and how to mitigate them. This includes information like adding extra capacity, blocking and rate-limiting traffic, or rolling back a bad deployment.
  • Service architecture and dependencies on other services, with contact information to reach people from those teams, if necessary.
  • Protocols to follow during an incident. I’ve added more about this subject in the incident response section below.

On-call Trainings

I recommend practicing on-call training with a cadence, even if no new on-call engineer joins the rotation. This ensures that everyone’s understanding of the systems they are responsible for is up-to-date.

  • Incident dry runs: Turn off certain parts of the system in a sandbox environment, observe which alerts are triggered, and see if your mitigation steps from the playbook help. Update the playbook as needed.
  • Ensure correct setup of your pager: Try changing your phone’s clock to a different time and sending a test page to make sure it’s not blocked. I’ve seen a few pagers disappear this way, so it’s important to check.
  • S0: An outage that impacts every customer. These are the problems that could have a negative impact on your entire consumer base, so it’s all hands on deck.
  • S1: An outage that impacts a subset of customers. Your business might partition customers based on where they live, or which applications they use to access your products and/or services.
  • S2: Any incident that needs attention, but doesn’t impact customers. These may be scenarios which could potentially become worse, but allow for a little time to correct before events affect your bottom line.

Commanding Incidents

The term incident commander is borrowed from Incident Command System, a standardized approach to the command, control, and coordination of emergency response.

Post-mortems

After each incident, it is important to use the understanding you’ve gained to prevent the same issues from happening again. One of the best ways to accomplish that is through post-mortems. In business, post-mortem documents and meetings define the incident’s root cause and seek to create action items that address identifiable gaps in service.

  • Timeline of events: This allows everyone reading the post mortem to follow the original team’s logic.
  • Root cause: This defines the root cause of the incident, which can be anything from employee mistake to a systemic malfunction.
  • Action items: These are the action items the team uses to address the incident’s root cause. Each item should have a clear owner and timeline, so future participants know who to contact with any questions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gergely Nemeth

Gergely Nemeth

Engineering Manager | Built @RisingStack, @GodaddyOSS @BaseWebReact | Find me at https://nemethgergely.com