Mastering Incident Reports

Table of Contents

Introduction

In the fast-paced and rapidly changing world of Internet services, network incidents like outages can strike at any moment of the day and night, disrupting services and frustrating customers. As leaders in the industry, it’s crucial to have a solid grasp on creating effective incident reports to instill confidence into the customers that the services will continue to run smoothly.

Crafting a narrative

Let’s delve into the art of crafting incident reports that cut through the noise and pave the way for higher customer satisfaction

Whenever there are incidents in any IT environment, the respective users or customers get very anxious and wary. They will try to call and email their service provider to know whether the problem is from the provider end or within their own IT infra/network. Once the provider confirms that the problem is from their end and they are working on it, the customers/users will mostly start following up more frequently and keep a tab on the issue

Types of incident reports:

1) Common Incident report

This is a common or a general incident report prepared for the masses of customers. This is mostly used in major or P1 network incidents or those relating to the overall IT infrastructure. There’s no customer specifics mentioned, and hence this report is sent out to all the affected customers/users.

2) Customer Specific Incident report

This is a customer specific report which is intended to be sent to 1 or more customers whose services were impacted. This incident report is tailored to the particular customer in question addressing their concerns. This incident report is mostly created for premium customers or those who have a specific SLA where they require a report for every incident. There are sometimes legal reasons too for such customer specific reports

Drafting the incident report

Drafting an incident report or finding the root cause/RCA is just like a crime investigation. It is the process of in-depth understanding of how events unfolded and how they gave rise to the incident in question. More often than not, there is detailed diagnosis required to find out the root cause of the incident as it is not very clear at the start. It may also be required to ask various engineers, other teams and other providers/vendors for information and timestamps. Then it comes down to jotting down the facts. Most organizations usually start with a standard template they have prepared over a period and they will start with this template. There are three great advantages to this –

1) It makes the process simpler,
2) It makes the reports consistent to previous ones,
3) Saves time in redoing the structure of the report

The prerequisite to draft the report is that the person who will prepare the report should have adequate knowledge of the network/infra, its history and its geography (span/reach) and should know the incident in great detail. The drafter should take into account small but valuable pieces of information from various sources. These sources can be the following:

Emails,
Internal chats,
CRM comments/updates
Device logs and diagnostic outputs,
SNMP traps from devices,
Syslogs
Phone call/Meeting log
Traffic statistics

With experience, a good incident report creation will also glean timestamps from the above sources and co-relate the facts to establish continuity. The following sections are mostly found on the incident report, but depending on specific case, sections may be added, removed or modified

Summary

This mentions the:

Ticket/case reference of the incident
Customer id / account
Customer name
Date of report creation

Impact Type

Impact type is whether the impact was circuit’s full-outage/hard down, intermittency, fluctuation/flapping service, slow/low speeds, application not working, etc.

Impact Duration, Start and End time

As the name suggests, incident start and end time and total duration should be mentioned. The total duration is mostly taken into account for SLA calculation

Sequence of Events / Timeline

This section should include the Date-wise and time-wise record of events happened and activities done for the resolution of the incident. This section becomes the major part of any detailed incident report and the timestamps should match that of the customer’s interaction.

It is also imperative to not keep a gap of more than 1 hour so as to showcase that prompt action was taken and high priority was given to resolve the incident, to reduce downtime and to keep the customer updated on all the events of the incident. Special mention about any ETA/ETR (Estimated Time to Restore)

Reason for Outage (RFO) / Incident summary

RFO stands for Reason For Outage, which is the main reason the incident or the outage occurred. The RFO is a fundamental part of any Incident report. Most people also use the terms RFO and Incident report interchangeably, but they are not the same. This RFO or the Incident summary is a brief one-liner reason as to why the incident happened.

Restoring customer confidence

The following sections try to restore confidence in the customer that the service is still very much reliable

Root Cause Analysis (RCA)

The RCA is a fundamental part of the incident report. This is made when the root cause of the incident or outage has been established. Finding the root cause can sometimes be straightforward or tedious and it is mostly done by NOC engineers, Field engineers, Senior network engineers etc. Important thing to note is that the root cause can either lie on provider side, customer side or somewhere else, but analyzing this is of the utmost importance.

5 Whys analysis

5 Whys is the analysis which may be applicable to some or all incidents. It is the process of repeatedly asking for the reason for ‘why something happened’.

Problem statement. Why?
Reason of problem statement. Why?
Event-1 happened. Why?
Event-2 or Reason of Event-1. Why?
Event-3 or Reason of Event-2. Why
Root cause of Event-3

To explain this with an example, see below:

Customers’ SAAS application stopped working. Why?
Network services used by the SAAS application went down. Why?
Network circuit at the provider faced an outage. Why?
The optical fiber towards the customer office location was cut off. Why?
Road construction works accidentally damaged the provider’s fiber junction box. Why?
The fiber junction box was in the open and unprotected.

Further Actions / Actions to prevent recurrence

This section conveys if the remedial action taken to resolve the incident were final or if there’s any further action to be taken to permanently resolve it. This section can also include any maintenance or plans to mitigate or prevent the same incident in future. It is important to note that preventive action is not a guarantee, but assurance, because things may break again, but this section instills confidence by making a genuine attempt for incident avoidance.

Internal Review & Report Release

This incident report takes a lot of inputs from various people, teams, vendors, devices and the customer’s communication and it is very crucial to fact-check it and make it as accurate as possible so as not to face any objections from the customer. The report should also highlight the strengths of the provider company and how professionally the mistakes or weaknesses were dealt with. For this reason, the report mostly goes for internal managerial review, and finally it is released to the customer

Nikhil Mistry

Share this article

Mastering Incident Reports

Introduction

Crafting a narrative

Types of incident reports:

1) Common Incident report

2) Customer Specific Incident report

Drafting the incident report

Summary

Impact Type

Impact Duration, Start and End time

Sequence of Events / Timeline

Reason for Outage (RFO) / Incident summary

Restoring customer confidence

Root Cause Analysis (RCA)

5 Whys analysis

Further Actions / Actions to prevent recurrence

Internal Review & Report Release

Related Blogs

Leave a Comment Cancel Reply