In this final articles in the series, "The Seven Sources Of Problems", we look at Source #7 - "Failures".
In this context we mean everything that could fail and have an impact on service or your capability to provide service; but not due to the other sources of problems, such as changes, upgrades, or suppliers.
"Things just sometimes fail". That's why we have maintenance contracts. That's why we have health insurance come to think of it.
There are a number of areas where failures can directly impact service provision and reduce availability:-
•Mechanical or hardware failure
•Data transfer or data formatting failures
•Processing schedule failure
Now, there will often be a 'cause and effect' challenge; for example - the air conditioning system would not have failed if we had realised that there was an extra load on it due to 100 new business users starting in that building last week.
So, What can you do about source #7 - failures?
There are several working practices that can be adopted to help relieve the pressure of new problems coming from this particular source:-
Preventative maintenance schedules. Not only the schedules but the actual smooth execution of the work. Talk to your facilities team about their schedules too. Why not combine schedules to work effectively together to minimise overall downtime to the business? For every component you should be able to determine its health, it current status and when it might be due a check-up from the maintenance teams.
Manufacturers root cause analysis, including faulty batch validation. It's amazing how many times you have a problem - say due to failure - but you don't realize that you're not the first, and you're not alone. Manufacturers and systems integrators occassionally suffer from potential mass failures - and sometimes need to perform an emergency field upgrade (to prevent the failure) or an emergency field upgrade (note - same title!) to replace a failing/ed item. It's important when these occur to check your full inventory for any other similar / realted items and also ask the Vendor / Manufacturer whoch 'batch' did those failing items come from, what is the root cause of the problem - and most importantly - will the new item work? How do they know? What if the same symptoms occur again? Ask tough questions - because you have been impacted.
“Where else could this failure occur?”. Simple question - but it's often effective at obtaining an answer that helps to prevent further failures. When you get impacting by a failure - ask yourself this question. Take action to prevent similar failures elsewhere. For example if you have four highly redundant, all singing, all dancing super powerful Routers and one suffers a power supply failure and shorts the entire backplane - taking it our of service... then what ensures that the other three won't suffer the same sometime soon? People will give you a hundred reasons why it was a "one in a million" - but is it really?
Proactive failure prevention programme. Sounds a bit grand this one doesn't it? But you will be pleasantly surprised if you just call in your top support people, along with the facilities team and some of your key vendors and just "brainstorm" this topic for a couple of hours. I suspect they will tell you a few home truths about cable infrastructure, labelling accuracy, items that failed maintenance and have not been replaced or re-checked. It's real-life and these things happen. But getting everyone together and creating the right atmosphere where people can contribute and help to identify the most likely items to fail - is really useful.
Know your S.P.O.F’s and mitigate them. Identify, hunt down and eliminate (or mitigate) single points of failure, where you can afford to do so - or afford to accept the risk of not eliminating the SPOF. Single points of failure can also be at a logical, rather than a physical level, especially with network configurations and databases. So, keep your perspective pretty broad on this topic.
As a final point on the Seven Sources model, you should consider the initiatives, tactics and general approach to the actions required holistically, rather than in silo's across the individual sources.
An integrated programme apporach, where you keep costs low by improving the way that your support people currently work, is recommended.
Everything we've listed here is not really 'rocket science' but within our ever changing environments it's key to keep on top of these initiatives to prevent problems in the first place.
I guess this old saying sums it up...
"Problems are not like wine - they don't get better with age!"