How to Design a Good On-Call 🚨
Everything you need to know about rotations, with lessons from Netflix, Dropbox, Intercom, and Google.
On-call is a divisive topic in engineering, and for good reason. People hate being on call because it's stressful and disruptive to their personal lives — even when they don’t get actually paged.
I know it from up close.
As a founder & CTO, I feel I spent enough time on-call for this life and the next three or four. In the worst cases, it was disruptive to my sleep, my morale, and left me not wanting to be anywhere close to a computer again.
But it doesn't have to be this way.
If people hate being on call, chances are you are doing it wrong. In the best teams, being on call actually improves the team’s morale. In fact, it can bring several benefits, like:
Strengthening the relationship between engineers and customers
Developing better ownership by engineers
Maintaining better docs
Enforcing good instrumenting / observability
In this article, we will explore the key elements that make an on-call process successful and we’ll cover how to design a great one. This will be drawn from my own experience and the one of successful companies like Netflix, Dropbox, Honeycomb, Intercom, and Google.
We will cover:
🏅 Ownership — the (non) difference between engineers and ops people.
📏 Scope — what goes into an on-call shift.
✏️ Designing rotations — everything you should take care of.
📉 Reducing effort — best practices to make things sustainable.
📊 Metrics — how to measure your on-call process.
Let’s dive in!
🏅 Ownership
The foundation for this whole article is that, as an engineer, you should fully own your code. This means your duties do not stop when the code is in production — you are still responsible for it.
This is the essence of DevOps and, in modern engineering, it is simply the most sensible choice.
If you forget about your code right after you deploy it, and pass the torch to ops, you are in for a bad experience: the feedback loop doesn’t work, and devs and ops people will simply hate each other over the long run.
In fact, the very divide between development and ops is blurry, and ultimately wrong. We are all engineers. People just work on different things and will eventually be on call for different things.
It is engineering’s responsibility to be on call and own their code.
It is management’s responsibility to make sure that on call does not suck.
This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.
— Charity Majors, CTO at Honeycomb
So, with good docs and instrumentation, all engineers can be included in on-call rotations, but it’s up to management to create a process that works.
But what do engineers exactly do when on-call?
📏 Scope
During an on-call shift, you can get paged anytime by alerts that report some issue in the system. When this happens, you follow a three-step process:
🔍 Root-cause analysis — figure out the issue. Enabled by good alerting and instrumentation.
🔧 Remediation — put the system back up. Enabled by good playbooks.
📋 Follow-up activities — update docs and do a bunch of things. Enabled by good process.
Let’s see all three.
1) 🔍 Root-cause analysis
Root-cause analysis is about figuring out what’s wrong. This is made possible by good alerting and instrumentation. In fact, the alerts you receive are of two kinds:
Customer-based — e.g. slow response time.
Engineering-based — e.g. memory full, various resources, etc.
In a perfect world, people should only be paged by customer alerts. In reality, you may end up with paging people for customer alerts + the most serious engineering ones, in case you have gaps somewhere and the latter do not fire also some of the former.
Andrew Twyman, former Staff Engineer at Dropbox, weighed in on this 👇
We had both.
The customer-facing alerts were the ones on which we defined our SLA/SLO, and had stated policies about response times, incident severities, etc. We also had alerts on things like resources, unresponsive machines, etc. Those were often more informational, and provided secondary signals to help the oncall debug what was going on. E.g. if you're paged for a high overall error rate, and you also get an alert that the number of machines in the pool is lower than normal, it gives you a place to start investigating.Some but not all of our internal alerts were configured to page I could imagine us evolving to a scenario where they didn't, given the argument that if a problem isn't affecting customer-facing metrics it's not important. To do that we'd really need to trust our customer-facing alerts, and we weren't confident that they covered all possible issues. E.g. performance is trickier to alert on than error rates, particularly since many of our requests operated on variable-sized batches. Note that we were a backend team, so our "customer-facing" alerts were about RPC responses, not true "customer experience".
The true customer-facing alerts in the Dropbox backend were similar, though, focused on responses to web requests, plus mobile and desktop client requests. Client-side metrics like crashes might be high-priority, but generally didn't page since you can't roll out a new client version in 5 minutes anyway. Client teams teams did have "oncall" but it didn't page, unless it was for backend services they owned.
2) 🔧 Remediation
Remediation is about putting the system back to an acceptable level of service. This is important to understand because it shouldn’t be confused with fixing bugs.
You can remedy an incident without fixing a bug, and without even knowing what the bug is.
In fact, this is the ideal course of action: you restore the system first, and fix the bug later. It is not always possible, of course. Sometimes, fixing the bug == restoring the system, like with many frontend issues.
But sometimes you can split the two. Think of a memory leak — you may be just fine by restarting the instance, while you take more time to figure out where the issue is.
3) 📋 Follow-up activities
After remediation, you have a number of things to do to leave a clean situation. These include: