7 Major Incident Management Tips
Major incidents are stressful. They’re the boogiemen of IT issues, which adversely affect business operations and outcomes. ITIL 4 defines them as: “An incident with significant business impact, requiring an immediate coordinated resolution.” And by their very nature, major incidents are challenging.
To help, this blog looks at ways of managing them effectively and offers up 7 tips for a better major incident management capability.
1. Check your facts
The first thing you need to do when dealing with a major incident is to ensure that you have all the facts. This isn’t an exhaustive list, but here are the key questions to ask such that you can get a handle on things:
- Is everyone safe? Are there any immediate hazards? First and foremost, look after your people and ensure that everyone is away from any potential danger. Sounds dramatic I know but this is a very real issue, particularly if the incident is related to something like generators, UPS maintenance, or mains electrical work.
- What service is affected?
- What is the business impact?
- What user base is affected? Is it a specific team or location, or is it everyone?
- What support team is looking at it? Do we have the right people engaged?
- Do we need to make other support teams aware?
- When did this start happening? Was there any change activity around the time?
- Is there a workaround?
- Do we have an ETA on a fix yet?
- Will we need third-party support?
- Do we need to notify any onward customers?
- Are there any security issues that we need to raise?
- Are there any legal or compliance risks we need to raise?
- Do we need to involve disaster recovery capabilities?
- Is the IT service desk able to cope with the current volumes of related calls?
- What is a realistic time for promising an update?
Get the basics covered such that you’re able to answer all (or at least most) of the questions you’ll be asked by your customer(s) and senior management when you make them aware of the issue.
2. Tell the right people and quickly
In an ideal world, you’ll have a list of people who need to be notified in the event of a major incident and your communications are templated, automated, and professional.
However, the reality is often that it’s very hard to be perfect at everything all the time. So, make sure you’re primed to tell the right people, the right information, at the right time when a major incident strikes.
First off, let’s deal with the who. In a major incident situation, you may have to communicate with some or even all of the following:
- Angry customers
- Fraught business stakeholders and service delivery managers
- Under-pressure technical teams
- Regulatory bodies
- The press.
Make sure that the right people speak to the right stakeholders. For example, engaging your compliance and legal teams if you do need to deal with any external parties.
When issuing communications make sure they’re as clear and easy to understand as possible. If you have any workaround information, make that front and center to people.
When communicating major incidents, ensure that you include the following:
- Incident title and reference
- Business impact
- Affected service and user base
- Any workarounds or self-help information
- Contact details for the service desk
- Time of the next update.
3. Create a plan of action
Loop back around to your support team and start pulling together an action plan. Make sure that you’ve assembled all the key teams such that nothing is missed and you can figure things out quickly.
Think of yourself as Nick Fury or Maria Hill with your support teams as The Avengers (Avengers: Endgame, OMG). Your role as Major Incident Manager is to coordinate and smooth things over, with your service desk and support teams your superheroes swooping in to save the day.
When working through the fix effort, you’ll often end up on a conference call or Skype sessions that seem to be attended by a cast of thousands. So, it’s really important that the call stays on track.
The tone you’re looking to set here is brisk, efficient, and kind. Remember that people are stressed and under pressure and sometimes the situation is less than ideal. When dealing with multiple stakeholders, things can also get tense.
Here are some suggestions for what to say to keep things calm:
|Situation||What to Say|
|A senior person is ranting at your techies (just to be clear ranting is never acceptable, but you can handle that later). Your goal is to protect your people and de-escalate the situation. You can deal with the person ranting and raving later.||“Thank you for your feedback but we need to focus on the fix effort. We’ll pick X up once the immediate issue has been contained.”|
|No one knows what’s going on and people are starting to get panicky.||“This is going to be absolutely fine. Let’s walk through this step-by-step so we can get a better handle on things. Is there anyone else that we need to loop in, so we have everything covered?”|
|A senior person is panicking about not having enough details.||“The situation is under control. We’re just pulling together a timeline and you’ll have it in your inbox in X minutes.”|
4. Check in regularly
Have regular updates with support teams and the business. If you commit to a deadline, then meet it. And don’t make people chase you for updates – otherwise, you risk them bypassing the major incident management process (part of the incident management practice in ITIL 4) and going directly to the teams involved in the fix effort, causing further frustration and delays.
5. Engage change
When a resolution has been identified get someone to double check it and carry out some initial testing. The last thing you need is for a fix effort to go wrong and make the situation worse.
Support your resolving teams by helping them get an emergency change raised and if appropriate, requesting an emergency change advisory board (CAB) meeting. Every organization is different, some companies will allow the change to be raised retrospectively so the issue can be fixed immediately. Others will need a condensed version of the change process – so engage with change management (or change control if ITIL 4 has been adopted) and do everything you can to smooth the way for the fix effort.
6. Tell people you’ve fixed it
Don’t forget the closure comms!
Once the fix has been deployed check that the relevant team has carried out the appropriate checks and then contact some of the affected users to ensure everything is working correctly. Once you have confirmation that everything is as it should be, send out a final notification that the (major) incident has been resolved and all is well.
7. Capture any lessons learned
When the incident has been resolved spend five or ten minutes capturing the key actions and any lessons learned before standing your teams down. You can have a more comprehensive review later but for now, spend a few minutes capturing the key events so that nothing is lost or forgotten about.
Once everyone is caught up on their BAU work, and are fed and sufficiently caffeinated, schedule a short review meeting. This can be done in conjunction with your problem management and continual improvement teams as major incidents will need root cause analysis and potentially lessons learned to prevent the issue from happening again in the future.
When undertaking the review, establish ground rules so that everyone understands they’re in a safe environment such that you can get an honest account of the event, any roadblocks, additional root cause information, and anything that needs to be done to prevent a recurrence. Work with problem management so that any workaround information is added to the correct database or wiki, and with continual improvement so that opportunities for improvement are captured on the improvement register.
What are your favorite tips for managing major incidents? Please let me know in the comments.
Posted by Joe the IT Guy