Do Quick Wins Exist for Availability Management?
In order to do their jobs, users need applications, systems and services to be there when they need them. They expect them to be available. When systems are offline, they get very frustrated at lost productivity and their frustration is quickly directed at IT. Revenue may be lost and IT incurs the cost of remedial action to get systems back online. Often, IT managers will be called in to answer difficult questions on why service downtime is hurting the business – and IT’s already fragile reputation takes another beating. Service downtime is a big issue for the business. Predictable, reliable services is what they want.
If you’re looking for quick wins to improve availability, there are none. High availability is reliant on all of the moving parts of IT working together – as well as an understanding in the business that building high quality, high availability services takes time and costs money. That means that all of the different areas of service design and service support must all be primed to focus on availability.
Fundamentally, the availability of a service is reliant on the infrastructure components that support it, as well as having robust ITSM processes in place to maintain, restore and optimise performance. To begin with, infrastructure quality dictates availability. Poor infrastructure quality means plenty of downtime and a lot of calls to the service desk. When a link in the chain breaks, the service goes down and remedial support processes kick in. Depending on how IT deals with the issue, the service might be out of action for a short while, or a long time. The part IT operations plays is damage limitation, and continual improvement.
The development perspective – high availability begins with good service design
There’s no such thing as perfect service design, nor the perfect infrastructure. Designs have flaws. Components fail. 100% uptime is impossible, but high availability always starts with good service design. IT managers like me (and indeed the business people asking for services) need to be acutely aware of this fact.
High quality infrastructure costs money, so high availability costs money. You get what you pay for. But IT budgets only stretch so far, so this is where the balancing act lies. Yes, with unlimited budget and time IT could design a theoretically unbreakable system – elegantly composed of only the finest equipment, designed with multiple component redundancy and automated failover. But IT doesn’t have unlimited time and budget. IT is constrained on all sides. Individual projects compete for budget. Resources are thin on the ground. The business always wants it now. So in many cases, IT has to make do and mend.
The problem is that cutting corners in design costs money in operation. More money. Eliminating an error by design incurs a fraction of the cost of fixing an error in a production system. This is something that software designers are all too familiar with. When you consider the cost of repeated remedial action to get an interrupted service back online (often involving overtime for expensive technical resources), it makes economic sense to design high availability into the system from the start. If the budget isn’t there, IT can’t do a proper job, so it’s up to IT to articulate the cost of downtime and persuade the business that it’s the right thing to do in the long run. Otherwise, it’s “save now, pay later”. And who get the blame when the sky is falling down? The IT department.
The operations perspective – effective support processes are critical to high availability
High availability isn’t just the responsibility of the service development team. IT Operations are responsible for keeping production systems running and services online. That means the core ITSM processes – incident, problem and change – play a big part in overall availability. When a service goes down it has to be restored quickly. Every minute of downtime is potentially costing the business money. An efficient incident management process is critical to identifying and resolving interruptions as quickly as possible. The clock is ticking and frustration in the business is rising. The service desk needs to know what to do – fast – whether that be taking direct action or collecting information and routing it to the problem management function for further analysis.
Of course, with limited IT resources, prioritisation is critical. The service desk needs to understand the business impact of downtime for each service in order to focus limited IT resources on the most pressing issues. Business critical services must take priority.
The problem management function has two jobs to do that impact availability. The first is to react to issues coming from the service desk. This needs to be done quickly and efficiently, as the clock is still ticking. Problem management has to have the right mix of people, processes and tools to find the root cause of the disruption and formulate an effective response to fix the issue and get the service back online. The second job of the problem management function is to pro-actively seek out infrastructure errors that are affecting performance – or might cause disruption in the future. Over time, as errors are eliminated, systems become more robust and availability is improved.
Over 70% of incidents come from badly planned changes and human error. IT people introduce new errors into the infrastructure when they add, remove or reconfigure components without considering the broader consequences (particularly in virtualized environments). When you’ve got a complex infrastructure with highly interdependent components, you fix one system and inadvertently break another. So, if you want high availability, you need to have control over the IT estate and who does what, when.
Each change needs to be carefully planned and follow a defined process – including a thorough impact analysis by the Change Advisory Board (CAB) to ensure any actions do more good than harm. In turn, this requires support from configuration management to ensure a map of the infrastructure is available to support change planning – and to make sure this map is up to date. Without it, change planning is a difficult and time-consuming manual task.
So we see that high availability is the product of good service design and robust support processes working together at different stages of the service lifecycle. At each stage, the IT people involved need to consider the impact of what they are doing on service availability. After all, availability is the number one quality attribute of a service. If the service isn’t ready for use, nothing else matters.
Posted by Joe the IT Guy