Last week we introduced the “Seven Sources of Problems” framework and listed the core seven sources for all production problems.
Today’s articles highlight source #6 – “Production Execution”
This source looks directly at the actual way that Operations, Production Support and any other maintenance and support function delivers the service they need to deliver to ensure non-stop continuous operations for a demanding commercial organization.
The way we actually manage, control, change and improve our production environment correlates the volume and frequency of new problems.
Stop/Start recovery procedures. Failing to have adequate stop/start procedures for your core Infrastructure, job scheduling, databases and other application support mechanisms will ultimately lead to new problems. It’s all about being in control. Can your IT support teams effectively take down and then, after some remedial action has been taken, start service again.
The goal is minimum impact and disruption. It is important to note that a significant proportion of new problems often stem from the way that service weas taken down and/or restarted. We have counted numerous times that either for routine (non service impacting) maintenance or during the life of a run of the mill incident, incorrectly executing ‘stop’ procedures or inadvertantly executing the wrong start script without due consideration to the time of day and the end state of the systems, can lead to far greater problems than necessary.
Checkpoints:-
- Have you mapped all your underpinning Infrastructure to a Service Map – so you know what needs to be available to drive your service?
- Have you clear and unambiguous stop/start procedures that are known and understood?
- Do your people know how and when to execute these procedures?
- Once service has been restored, are all data feeds and file transfers as you would expect them?
Automated tools and the ability to handle out of line situations. Tools are great – they cut down on human interventino and help us to manage moer effectively. However, the modern plethera of tools and their in-built complexity often means that they “sprawl” out on their own and no-one really knows what’s happenning with them, in terms of when they execute their own housekeeping, tidy-up routines and perform their core processing. To this end, when problems occur and the root cause requires rapid identifiaction and elimination – you really need to use the full power of your diagnostic tools to assist.
Checkpoints:-
- Have you got the right tools, measuring and capturing the right things?
- Will you tool configuration proactively support your problem management process?
- Are your tools paying off? Do you spend more time ‘tuning’ them – than they do providing benefit for your environment?
Impact of changes on production schedules, in particular business/environmental changes. The key point here is really simple, when aspects of service change (start time, number of users, new interfaces) does your underlying processing also change? There are many occurrences where the business aspect changes, but someone back in Operation forgot to re-schedule or re-design the overnight processing flows to accommodate these changes – resulting in new problems.
Checkpoints:-
- Are tools and scheduling products automatically linked to your change management process?
- Have you got your processing sachedules really ‘nailed down’ and under control?
- If an overnight job fails, are clear re-start instructions in place for your teams to continue processing without the need to call for support?
Human error. Final area to explore in this source of problems – but perhaps the most important. Human intervention by support teams, service teams and the business admin people can lead to procedures being executed incorrectly and errors made that cause new problems. It is critical that this source of problem is identified and the relevant person understands and knows ‘the error of their ways’. The biggest reason why people do not own up to human error is because their working environment suffers from a ‘blame culture’. Our advice is simple, eradicate any ‘blame culture’ and focus on a building a supportive culture with education to help people fully understand how to d othings correctly first time.
In the final part of this series of articles we explore source #7 - "Failures" where we recommend some key actions to help minimise how this source can impact your environment.