top of page
  • Writer's pictureNilesh Mahajan

How to bring on-call load under control, and keep it that way?

Updated: Oct 30, 2023



Is your team drowning in on-call alerts, risking customer satisfaction, and struggling with its well-being due to never-ending fire-fighting? Does your team need occasional heroic sprints to just focus on fixing on-call issues? Read on.


If you're responsible for mission-critical software systems, on-call responsibilities are a reality you can't avoid!


While you may want "zero-incident" culture, it may be unrealistic to completely eliminate alerts. A rule of thumb is to aim for an average of 1-5 alerts per on-call shift.


The goal is to get to a sustainable on-call state. But how?


In this blog, we'll introduce the Lean AAA methodology to help you achieve just that.



Step 1: Annotate - Collect Right Set of Data

Reducing the number of alerts your team receives during each shift starts with prioritizing the most critical issues. To do that effectively, you need accurate and relevant data.


This data should answer questions like whether the alerts were genuine issues or just a noise, whether they required immediate action, how they were resolved, and what follow-ups are needed.


Annotating your alerts with this information and consolidating it at the end of every shift is a crucial first step.


Step 2: Analyse - Gain Insight from Data

With the right data at your fingertips, you can conduct in-depth analysis and discuss the findings during on-call handoff meetings with your team.


Your discussions can be along the following lines of

- How many alerts were not real issues? Why?

- How many alerts were really actionable for your team? How to avoid such alerts?

- How many alerts took longer time to resolution? Why?

- What is the status of runbooks?

- What can be done to bring the overall alert count down?

- What are the immediate action items?


This will help you identify both - big rocks and the low hanging fruits. With this data, you can prioritise efficiently, justify the extra bandwidth if required and predict the outcomes.


Step 3: Act - Implement Improvements

Once you have action items figured out, its time to implement and repeat every shift.


Wrap-up


We personally have executed this methods as engineers and in leadership roles of large tech companies as well as in start ups clearing seeing the result in terms of on-call stress reduction. Our further research across 40+ large to small companies shows that this is an effective model and engineering teams desire for the same provided right sets of tooling. By following this simple Lean AAA methodology, you'll make your on-call process significantly more manageable, improving both your team's morale and your system's reliability. Learnings from this discussions will educate your team how to get better at operations and build more sustainable systems


To streamline the implementation of the Lean AAA methodology and achieve results faster, we have now built Next9, our Slack bot designed to eliminate friction and automate this entire process.


Next9 bot saves your team valuable time that would otherwise be spent on data gathering and manual report creation. It also eliminates the need for costly PagerDuty packages.


Please give it a try and feel free to reach out at info@next9.ai if you have any questions or feedback.




286 views0 comments

Comments


Know More

Never miss an update

Thanks for submitting!

bottom of page