ICT Major Incident Review: what outcome are you really after?

Auto-generated description: A laptop with a dark screen displays a circular logo in the center, set against a softly blurred background. After the successful mitigation of a ICT Major Incident in an organisation, you would want to conduct a Major Incident Review (MIR); which is also sometimes referred to as a Post Incident Review (PIR). However in my experience, there are often differing opinions on what the outcomes of such a review are.

When an organisation’s Executives are requesting such a review, the outcomes they are after are often clear: how did this happen & what are you doing to prevent this reoccurring again.

Now, if you have any experience and interactions with the ITSM practices you may recognise those outcomes as an output of the Problem Management practice. I’ve often come into an organisation to find their MIR procedure and documentation simply serves as a way to generate a high level problem summary; quickly put together at the insistence of the C-suite and/or their direct reports.

I believe the pressure to determine the root cause and mitigation actions to address the problem in a significantly shortened timeframe, often requested within 24 hours, is (quite honestly) utterly stupid. It potentially compromises the determination of the true root cause and development of a sustainable and most appropriate action plan to address the problem, in favour of providing an expedited response to Executives who want to appear to their bosses or stakeholders that they have the answers.

Ideally, the Problem Management activities should be allowed to proceed as per usual to get these answers. But where does that leave the MIR if we are separating the Root Cause Analysis (RCA)?

For my part, I try and focus the MIR as a review of the detection and response to a Major Incident; focused on four key areas:

Detection of the incident and identification of its impact to the business (did we know about the incident and its impact as quickly as we could)
Coordination and management of resources to respond and mitigate the business impact as safely and quickly as possible (did we get the right people involved as quickly as possible, and did they follow the appropriate procedures in response)
Review of the actions taken and decisions made to determine if future incident responses could be improved to reduce the length of the business impact (did the actions taken result in minimising the impact to the business in the shortest possible time)
Communication of the incident to the appropriate stakeholders (did we follow the communications plans & did they achieve the outcome of informing the relevant parties appropriately)

Of course, the above is the ideal scenario for myself where the MIR and RCA are isolated to their respective practices; however the reality is that there will be stakeholder pressure to report on the incident’s cause regardless.

To address this need, I have included a Root Cause section early in my MIR template. However, it is made clear in the documentation (unless it is already determined) that the root cause listed in the MIR is what is understood at the time of generating the report. If a root cause is still being worked on through a problem record, then the MIR report should outline the problem record reference where the complete RCA will be published once finalised.

So you perform a review and end up with a number of actions to address the gaps identified within the review; how do you manage the MIR action items?

The second point of contention I find in the MIR procedure is how and where to manage the actions resulting from the review. I’ve seen dedicated tooling solutions to list, assign and manage the actions generated from the MIR procedures. I’ve also seen MIR actions shoehorned into its own problem record and related tasks.

Both of the above present issues. Having a separate list of actions from MIRs presents yet another procedure and list of items for teams to manage. Combined with all other lists and work that needs to be done, it can lead to fatigue and resistance to engage and truly address the actions raised. Having MIR actions managed in a dedicated problem record incorrectly leverages the problem practice to manage actions items that are often not related to addressing an ongoing problem; while also incorrectly inflating the metrics and measurements.

After wrangling with how to track and manage MIR actions items, my current guidance is to not have them as a separate action list. However, every action item should be an input into an existing practice where they can be managed.

Many items will be updates to procedures or improvements to ways of working, which is best managed via the CSI practice or simply a knowledge uplift. Sometimes, there’s cleanup work to fully restore the service to its state prior to the Major Incident; this would result in a separate incident, request or change records depending on the action needed. As much as the MIR should not be a problem discussion, items are naturally raised in relation to the root cause or addressing the underlying problem which can be captured in the existing problem, or trigger a new problem record.

The MIR report should still list and outline the action items raised from the meeting; however it should then list the record reference number for the related practice where that action item will be managed.

I feel it is important to outline that the above is where I’ve landed so far in relation to the Major Incident Review procedures, focusing on what I feel provides the best value for an organisation.

If you have different procedures or methods outside of this, I’m curious to know what they are and how they provide value to your Service Management practice and the organisation?