ITSM-Basics-red

ITSM Basics: A Simple Introduction to Problem Management

 

If you regularly read my blog you’ll know that I’ve already written a fair bit on the tough nut to crack that is problem management. It’s often something that’s started as part of the latest IT service management (ITSM) tool implementation project, but it’s not unusual for this initial investment in problem management (processes) to fail in execution due to one or more reasons.

From a problem management uptake perspective, if you believe what the annual industry surveys report, roughlytwo-thirds of IT organizations are already “doing” problem management. But it’s not always what it should be, i.e. the investment of time and resources to proactively investigate and address recurring IT and business issues, and their root causes. It’s this type of investigation that helps to identify the issues that cause (or may ultimately cause) repetitive and potentially serious IT and business issues or failures. Instead, IT organizations are often just doing major incident reviews, using problem management techniques, as and when needed. It’s problem management of sorts but not truly effective problem management.

In reality, problem management is often somewhat of the “poor relative” to service desk and incident management activities. Whereas service desk and incident management are commonly receiving adequate investment in terms of staff, definition, training, and ongoing operation, problem management, on the other hand, is often “something to be done later” and therefore often not done at all.


In my opinion, the low levels of proactive problem management adoption is quite ironic. The pressure to cut IT operational costs is why many IT organizations don’t do problem management, but it should be the reason why they need to be doing problem management. Of all the major ITIL processes, the investment of time and resources in truly effective problem management activity can provide some of the highest returns to an organization.

So to give you a simple introduction to problem management, I’ll quickly cover:

  • What problem management is
  • The objectives of problem management
  • The “problem lifecycle”
  • Problem modelling
  • The benefits of problem management

I refer to ITIL a fair bit, you might think too much, but you can quite easily use your own self-created problem management process and activities or look to alternative sources of ITSM and IT management advice such as ISO/IEC 20000, ISACA’s COBIT, USMBOK, or Microsoft’s MOF.

Where Problem Management Fits In

Problems (definition below) can be identified throughout the IT ecosystem. For example: acceptance into production, changes, updates/patches, vendor products, user errors, production execution, and failures. However, the main source for problem identification with an organization is probably the analysis of incidents as part of what is often called the “proactive problem management process.”

However, not only is problem management often solely associated with major incidents, another barrier to effective problem management is that problems are often confused with incidents (with the terminology interchanged wrongly). Or they are seen as an incident state rather than a separate entity requiring a different type of ITSM response.

If it helps, an easy way to remember the difference between the two is that:

  • Incident management is “put the fire out ASAP!” (so it’s firefighting), whereas
  • Problem management is “how did this happen?” and “how do we stop this happening again?” (so it’s arson investigation/fire prevention).

To succeed at problem management, IT senior management needs to appreciate that far too much costly, and possibly scarce, IT resources are currently spent fighting repetitive fires and that these resources would be better utilized supporting problem management activity to tackle the root causes, rather than the symptoms, of IT failures.

Problem Management Definition

ITIL, the ITSM best-practice framework formally known as the IT Infrastructure Library, uses the term problem to describe:

“The unknown cause of one or more incidents.”

With problem management used:

“The process of minimizing the adverse effect on the business of incidents and problems caused by errors in IT infrastructure and systems, and to proactively prevent the occurrence of incidents, problems, and errors.”

ITIL 4 defines the key purpose of problem management as being ”to reduce the likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors”. A problem will become a “known error” when the root cause is known and a temporary “workaround” or a permanent alternative solution has been identified. Known errors are a part of an organization’s technical debt and should be removed where reasonably practicable.

For completeness, although I state my own benefits below, ITIL states that the value of problem management includes:

  • “Higher availability of IT services by reducing the number and duration of incidents that those services may incur. Problem management works together with incident management and change management to ensure that IT service availability and quality are increased. When incidents are resolved, information about the resolution is recorded. Over time, this information is used to speed up the resolution time and identify permanent solutions, reducing the number and resolution time of incidents.
  • Higher productivity of IT staff by reducing unplanned labor caused by incidents and creating the ability to resolve incidents more quickly through recorded known errors and workarounds.
  • Reduced expenditure on workarounds or fixes that do not work.
  • Reduction in cost of effort in fire-fighting or resolving repeat incidents.”

Problem Management Objectives

ITIL defines the objectives of the problem management process as:

  • “Preventing problems and resulting incidents from happening.
  • Eliminating recurring incidents.
  • Minimizing the impact of incidents that cannot be prevented.”

Importantly, it can’t operate in a vacuum.

Problem management should have strong relationships with other key IT service management processes. In addition to the more-obvious linkages with incident and change management, it also needs to use configuration management data to help determine the impact of problems and resolutions. Let’s also not forget that availability management has a dependency on problem management information and activity, and some problems will require investigation by capacity management teams and techniques.

Problem management can also be an entry point into IT service continuity activity and major incident management, where a significant problem needs to be resolved before it starts to have a major adverse impact on the business. Finally, from a service level management perspective, problem management contributes to improvements in service levels, and its management information should be used as the basis for service review activity.

The “Problem Lifecycle”

While not a linear lifecycle like incident management, you can view a problem going on a journey from identification through to “resolution.” Where resolution might come from error control or the creation of a workaround.

Thus it’s worth understanding that there are two common problem management “sub-processes”:

  • Problem identification – investigating the causes of incidents that have already happened and identifying problems before they cause incidents, assess the related risks, and optimize the response with the aim of minimizing the probability and/or the impact of incidents.
  • Problem control – which focuses on transforming problems into known errors (and workarounds)
  • Error control – which focuses on resolving known errors via the corporate change management process

The result might be one of three outcomes:

  1. That a change is required to correct a problem – the organization should use an “error control” process to correct the problem via the corporate change management process.
  2. A problem cannot be fixed but a workaround is identified; the problem is classified as a known error with a workaround (a temporary way of resolving the incident); it’s logged in a known error database and made available to all support teams for ongoing incident resolution activity.
  3. No fix or workaround is identified. When a problem is investigated but no solution or workaround is identified, it is recorded as a “known problem” — with the information again made available for the benefit of all support teams.

It’s important to recognize that these three problem states are not mutually exclusive and that a problem may move between them over time. For instance, when possible, a workaround should still be made available while a problem is awaiting the implementation of a required change.

My simple diagram hopefully provides a snapshot of what can happen with problem management.

ITIL's problem lifecycle: the flowchart

Problem Management Modeling

As well as the aforementioned hardware and software, ITIL 4 suggests researching documentation, third-party components, standard data, highly sensitive data, consumer resources, and highly regulated services and systems. By prompting problem management practitioners to focus on other aspects of the service, modeling gives support teams the space to manage the issue more appropriately. For example, if it appears to be a hardware fault that caused a problem, it’s tempting to swap out that component. However, without 100% certainty, there’s the potential not only to fail to fix the issue, but it could also be made more complicated to resolve. By leaving room for the possibility that the fault could be caused by outdated software, working practices are not being followed. Technical debt, or even a compatibility issue with a supplier system, makes colleagues focus on the overall problem rather than getting pulled into a silo, making it more likely that it’ll be fixed on the first attempt.

Problem Management Benefits

In my opinion the key benefits of problem management include, but are not restricted to:

  • Increased responsiveness from IT as there’ll be a reduction in time wasted on dealing with preventable issues
  • Decreasing downtime and thus potentially maximizing business productivity
  • Preventing incidents before they adversely impact business operations
  • Making better use of potentially scarce IT resources
  • Better collaboration between different IT teams in preventing recurring issues; defined roles and responsibilities and a single, consistent process not only speed things up but also reduce duplication of effort and wastage
  • The ability to leverage existing known error and “workaround” knowledge to prevent the proverbial “reinvention of the wheel” and to speed up resolution
  • Reducing the costs associated with both IT service delivery and IT support – best practice processes and automation can both save time and effort, and therefore cost
  • Reducing the adverse effect of business-impacting incidents through prevention or workarounds; this might potentially include lost revenue, lost reputation, or even lost customers
  • Improving customer service and the business’s perceptions of the IT organization as a whole

Well there you have it, a quick guide and a simple introduction to problem management. Hopefully you found it helpful.

If you want to read more from me, and few of my friends, on problem management, then please look at:


Posted by Joe the IT Guy

Joe the IT Guy

Native New Yorker. Loves everything IT-related (and hugs). Passionate blogger and Twitter addict. Oh...and resident IT Guy at SysAid Technologies (almost forgot the day job!).