How to Manage On-Call Duties
Mastering the Art of On-Call Management with Some Insights from My Own Experience
I still remember joining Namecheap ten years ago. Together with my small team, after delivering our first cloud-based product, we found ourselves on call every day, including weekends.
As an always-connected nerd who loves action, receiving PagerDuty calls was an adrenaline rush for me. However, it soon became unsustainable for the team, leading me to explore and organize on-call duties more effectively.
Today, my department has a fully dedicated SRE team that handles a significant portion of the on-call duties. Looking back, I've learned many valuable lessons over the years that I'd like to share with you and try to condense in this article.
Here's what we will cover:
❓ What is on-call, and why should you care about it
🛠️ How to manage on-call duties within your team
💡 Tips coming directly from my experience
Let’s begin!
❓What is On-Call Duty?
Generally speaking, on-call duty refers to the practice of being available and ready to respond to work-related issues as needed. In the world of system and software engineering, however, this often means being available during emergencies and critical situations.
On-call duty can involve:
👨💻 During working hours: a person or a part of the team is designated to respond to critical situations, emergencies, or support requests during regular working hours.
🛌 Outside working hours: one or more individuals must respond within a certain timeframe and be available to respond to critical situations outside of regular working hours.
The main purpose of having on-call responsibilities is to ensure that critical systems and services remain operational, and any critical issues can be addressed promptly, regardless of whether they occur during regular business hours or not.
Benefits and Downsides of On-Call
As you might imagine, especially outside of working hours, there are many downsides to being on-call, but I believe it also comes with certain benefits.
❌ DOWNSIDES
Work-Life Balance: if not managed properly, on-call duties can undermine work-life balance.
Unpredictable Schedule: it's challenging to predict when you'll be paged for an issue, yet you must ensure your availability. This makes it difficult to plan your personal life.
Complex Problems: on-call often involves addressing challenging issues under pressure.
Feeling Isolated: especially in smaller teams, there might be only one designated person for on-call. This can significantly increase stress during critical situations.
Risk of Burnout: the continuous demands of being on-call, coupled with so-called "alert fatigue," can lead to exhaustion.
✅ BENEFITS
Reliable Systems: having people on-call contributes to maintaining stable and trustworthy systems, with a clear benefit for your customers.
Skill Growth: it provides opportunities to learn and solve a wide variety of problems.
Career Boost: it offers chances to demonstrate reliability and expertise, which can enhance career growth.
Extra Compensation: on-call often includes additional compensation or benefits.
Ownership: being on-call can enhance a sense of ownership over the systems or solutions your team have built.
How On-Call Works
In the engineering field, on-call duties can vary widely, but they generally revolve around a few central concepts:
🌀 Rotation: a schedule that determines which team members are on-call at any given time, ensuring that responsibility is shared and that no one person is overwhelmed.
⬆️ Escalation Policies: guidelines that outline how issues are escalated from one level of support to the next if they cannot be resolved within a certain timeframe or if they exceed the current responder's expertise.
⏱ Response Time: the expected time within which an on-call engineer should acknowledge and start addressing an issue. This can vary based on the urgency and nature of the problem.
These core elements are influenced by various factors, including:
👥 Team Size: larger teams may have more flexibility in creating rotations and managing on-call duties, while smaller teams might face more challenges in spreading out the workload.
🏗 Team Structure: the way a team is organized (e.g., by function, product, or service) can influence how on-call responsibilities are assigned and managed.
🚨 Criticality of Services: the importance of the services being supported can dictate how strict the on-call policies need to be, particularly in terms of response times and escalation procedures.
💰 Budget: financial resources can affect the ability to provide incentives for on-call duties, invest in supporting tools, and hire additional staff to share the on-call load.
🤝 Team Culture: the values, expectations, and practices within a team or organization can greatly influence how on-call work is perceived and managed, impacting everything from rotation schedules to how incidents are handled during off-hours.
🛠️ How to Create Effective On-Call Practices
As we've mentioned, on-call duties vary significantly across companies due to numerous variables. However, based on my experience, I believe there are several principles that can be followed to establish effective on-call practices.