Your on-call doesn’t have to suck!

Sudeep Chaudhari
Jun 21, 2023
6 min read

In today's fast-paced and interconnected world, where technology drives businesses forward, maintaining reliable and responsive software services is paramount. That's why all engineering teams have an on-call process.

The on-call process defines a framework for the team to determine how to react in the event of system failure and then how to proactively prevent future failures. However, at the same time this brings additional burden on engineers and engineering leads to be on top of everything under an extremely tight timeline. This also makes them feel non-productive and constantly stressed out, impacting their personal life outside of working hours.

The Problem

As part of the on-call process, almost every on-call team does a weekly on-call review meeting. Usually, the on-call engineers are responsible for resolving all the alerts or incidents and documenting the handoff summary for the next on-call engineer. Engineer leads or managers also often use this information to gauge the overall on-call process and reliability measures.

Let's go through following experience:

Imagine your phone rings with an alert at 2am in the morning. You login to your computer to check the alert but only to find a summary without a lot of information. Then you log into another portal for full details of the alerts. After initial understanding of the issue, you try to find if there is a runbook for these kinds of issues which are generally in documents or wiki. But after a lot of searching, you either find there is no runbook or it's not updated. As a next quick step you realise there might be similar alerts triggered in the past with the hope that on-call engineers might have noted the steps on how it was resolved. So you start going through previous alerts one by one to find similar alerts but find out that though there were similar alerts but there are no clear mitigating steps noted. Finally you start debugging the code which you have little familiarity with. Now all of this is happening at 2am in the morning!

Let's go even further. So far you have found a temporary solution for the issue but have questions to an expert engineer who owns that service or part of the code. However, you don't know whom to reach out to. So you start asking wider groups on the internal communicator. As a good engineer by the end of your on-call rotation you start documenting the issues and alerts. However, that is taking a lot of time as you have to manually get all the links of the issues, consolidate all your notes and aggregate the counts or metrics. You go through the hand-off call and feel relieved about your work! But guess what, as that issue we talked about earlier happened during unfriendly hours and under a stressed situation, you forgot to tell your team to update the runbook and missed that command you ran to investigate as it was on your sticky notes. Few weeks later when another engineer was on-call, a similar issue was triggered at 3am in the morning!!!

Let's break down this situation into the actual problems we have in on-call engineering today:

1. Disintegrated information in multiple tools

As seen in the earlier example, an engineer has to navigate at least 3-4 tools just to get to the crux of the problem. This is highly inefficient and time consuming especially during a high severity issue where every minute counts.

2. Hard to discover useful historical data of previous incidences

There is a high chance that alerts or incidents reoccur most of the time. So from time to time someone from your team has solved it. However, given the use of multiple tools, it's very difficult to find the similar alerts triggered in history. Even if you find them, they are not useful unless they are updated with the information which will help resolve the issue for an engineer who is referring to it. So this is often a missed opportunity for the teams who leverage the rich historic information where resolution of the issues could be readily available.

3. Stale mitigation documentation

Most of the teams write runbooks for the most important issues in their system. However, those runbooks are only beneficial unless they can be easily discovered and up to date. In today’s world, these are often maintained in external repositories like docs or wikis without appropriate linkage. Also there is no easy way to find out which runbooks need updates.

4. Writing the summary document is quite time-consuming and lossy.

In most companies, we observed that on-call engineers write all the alert summary into a doc/wiki/excel template by hand. For every alert, they try to capture a brief description of resolution steps, follow up items and tag them. We see below problems with this -

This could potentially take up to an hour or more, depending on the volume and severity of the issues.
Also due to the exhaustive nature of the document, most engineers prefer to write these at the end of their shift and team leads/management has very little opportunity to intervene on a day-to-day basis.
This process is also highly error-prone. In our experience, we have seen engineers miss mentioning alerts, classifying them correctly or losing follow ups.

5. It difficult to build up high level summary and insights for the management

Summary documents lack relevant information visualizations and deep integrations with alerting/monitoring tools. Hence it is quite hard to extract insights/reports from the summary documents. That leads to chaos in establishing common understanding of the operational situation and making right prioritization in the organization.

The Research

Here are some of the stats based on market research and our own survey

90% of Individual Contributor engineers say they spend time on nights and weekends monitoring and resolving issues negatively impacting their professional and personal lives
40%+ engineers say there not enough context in the alerts for them to triage the incidents quickly
More than half of the engineering say they are often frequently stressed out during on-call rotations
1 out of 3 engineers say that some of the on-call experience make them want to quit

The Solution

We have personally faced these issues and often hear it from our friends and colleagues. We desperately felt that there should be a tool which will make this on-call experience robust and manageable, also improving the quality of the life of engineers. Today though there are widely used tools for reliability management, we didn’t find that ‘one’ tool which can solve all these problems.

That’s why we are starting up Next9 with the mission to “Bring next 9 in your reliability management”.

To solve the above problem, we are starting with an on-call dashboard that will unlock below values to your organization -

1. Develop common understanding

A central place to track and coordinate all on-call activities and quickly establish common understanding of the situation anytime. This way engineers do not have to go through multiple tools to resolve the incidences.

2. Right information at the fingertips

The on-call dashboard will organise and manage your information in a streamlined manner making it easy to search the information quickly. For example, similar incidences and action taken to resolve them in the past can be found linked to the newly triggered incidences.

3. Actionable insights to keep the documents up to date

We will provide you detailed insights into the usefulness of the documents such as runbooks and which of those need updates so that they can be used for incident mitigation.

4. Complement your existing tools

The on-call dashboard will also integrate with your existing collaboration, monitoring, task management and incident management tools so that you can have a streamlined on-call process.

5. Data driven decisions

The on-call dashboard will help you build insights and reports in minutes, so that management can make the right decisions.

The Impact

We believe Next9 delivers below values as you go with it:

Short term

Bring everyone on the same page
Quicker time to resolution so that engineers can focus on innovation
Minimise on-call burnouts improving personal life

Long term

Improve reliability by reducing recurring issues and outages
Build robust on-call and reliability culture

Thank you

If you are reading till here, we would like to thank you so much! We have been working on an initial version of this product over the past few months. If your team is battling with on-call chaos and are interested in establishing best practices faster, we would be happy to partner with you.

Feel free to reach out at info@next9.ai if have any questions or you are interested in becoming a partner in our journey.