Managing a Major Problem
The ITIL Service Operation book says you should have a “major problem review” after resolving a major problem, but it doesn’t provide guidance on how to recognize a major problem or how to manage it. In this blog I’ll try to help you understand what a major problem is and some of the things you can do to help manage major problems.
What Is a Major Problem?
Let’s start by distinguishing a major problem from a major incident. If the service is currently not working and you need to restore service then this is an incident. If this incident has a significant business impact then it may be a major incident, depending on what you have agreed with your customers. If the service is currently working, but there have been many significant incidents and you have not yet understood or resolved the causes of these then you may have a major problem. If you (and your customer) expect to see repeats of the same incidents causing significant business impact then this is a very good indication that you may have a major problem.
Some organizations log a problem for every major incident, but remember that incident management will enable the business to resume working, problem management activity does not usually start until after the incident is closed. In summary, you have a major problem when the service is currently working but you have every expectation that it will fail again and that this failure may have a significant business impact.
Examples of major problems are:
- Repeated crashes of a server that causes a loss of service while IT recover the situation
- Frequent periods of very slow performance
- A single unexplained event that led to loss of service for many users.
How Should You Manage a Major Problem?
I have been asked to manage many major problems and the first thing I always do is to get all the technical people together in one room – even if that involves flying them in from remote locations – and I tell them to STOP TRYING TO FIX THE PROBLEM! Sometimes the problem has been going on for many months, management are shouting and demanding that IT resolves the problem, and everyone is making conflicting recommendations about how to proceed. It is essential to focus on the important tasks first, and fixing the problem is NOT the most important. It can be very hard to persuade people to listen, but until I can get them to cooperate on the analysis they are usually just making things more confused.
Here are the key steps that I usually follow when I take control of a major problem:
- Get hold of ALL the incident records related to the problem. If the problem has been going on for a long time then this can be difficult, but it is essential to make sure we have good quality data to analyse.
- Work as a team, with all the technical support staff involved, to identify how many different problems we are dealing with. It is at this stage that we often discover that there are many more problems than was originally supposed. For example one customer had frequent incidents where performance of their application was very slow. When we analysed the incidents we discovered seven different patterns of behavior, each of which almost certainly had a different underlying cause, and each of which needed to be analysed as a different problem.
- For each problem that has been identified you must decide what will happen next time there is an incident of this type:
- Get the technical people to document the best workaround that they can based on the knowledge that they currently have
- Document a trigger that can be used to recognize that this problem has occurred. This trigger may be manual activity carried out by the service desk, or you may be able to automate it via your monitoring tools and event management
- Make sure that your service desk and/or your operations staff understand the trigger and the workaround, and are prepared to carry out the required activities next time this problem occurs
Carrying out these three steps can lead to a significant improvement in the situation. For one customer we achieved a 90% reduction in weekly downtime by just doing this. When you have good triggers and workarounds in place this will give you a bit of breathing space to allow you to investigate the problems properly.
Don’t forget to keep communicating with the customer throughout this activity. Often the customer is reluctant to allow you to stop trying to fix the problems at first, but as soon as they see the benefit of improved workarounds they will come round.
Now that you have a good workaround in place you can start to investigate the underlying causes of the problem(s). If you have identified multiple distinct problems then get the customer to prioritise them, based on their perception of the impact, and just work on the most important problems – the number you can deal with depends on what resources you have available but don’t spread your resources too thinly. It’s better to fix a few things than none at all!
While you are working on this root cause analysis you must monitor the effectiveness of your workarounds. Each time one of the problems recurs you should ask two questions:
- Did the trigger work well? Can we make it faster or more reliable?
- Did the workaround work well? Can we make it faster or more reliable?
- Is the problem priority still correct? Should you still be working on the same problems?
As you continue to work on the problems you should keep sending regular updates to the customer, including information about how well the workarounds are protecting their business. This will allow you sufficient time to do the root cause analysis thoroughly, and really understand the problems. As you fix each problem you can start to work on one of the lower priority problems.
After you close each problem don’t forget to carry out a “major problem review” to learn how you could do better next time.
Posted by Joe the IT Guy