The challenges of providing support to real time operations

An IT solution that works is one that doesn't generate problems, right? Well, then no IT solution works because the only certainty that exists is that if an IT solution is used, sooner or later, problems will occur. It's not just about avoiding problems, it's about being able to respond adequately to those problems and minimizing their impact.

In a business environment that requires 24x7 continuous operation, the demands to mission-critical applications are simple: they cannot stop! The ideal would be that they never fail, but in reality, the probability of failure increases due to the complex, process dynamics and external variables dependencies, such as systems integration, hardware or the human factor.

The main challenge that support faces is to act as the guardian of operation stability, maintaining solution efficiency levels within the expected levels. For that reason, it's critical to adopt a set of good practices, both in terms of problem prevention and problem solving.

Therefore, I would like to emphasize the importance of, on the one hand, anticipating failure contexts or key processes performance degradation and, on the other hand, the fault containment efficiency.

Failure or performance degradation prediction and preventive support is achieved through continuous indicators monitoring, not only infrastructure performance, but also application performance and data efficiency. Fault containment or corrective support, either through redundancy or activation of temporary alternatives that result in the normal continuation of the operation - even if with inferior performance/functionality - allows a careful diagnosis of the root cause and implementation of effective corrective measures.

Preventive strand

The implementation of continuous monitoring processes and automated alarm systems are essential tools to objectively monitor process performance. The variables involved are not always linear, although some are easily observed. For example, hardware resource usage, however they have to be cross-referenced with application processing indicators to gauge relevant performance information. This set of metrics will be so much more useful depending on the ability to point unambiguously to the possible source of failure.

This early detection allows timely planning of interventions to eliminate risk, whether through maintenance, repair or even rescaling of the required resources. It also means that even if it can cause stoppage, it can be programmed in order to minimize operational impact.

At the same time, an extremely important factor is redundancy level, both in infrastructure - critical equipment available for immediate replacement, communication channels - and in the adoption of application solutions that implement native distributed processing. I emphasize this last one because it not only allows a more cost-effective infrastructure, but also becomes less dependent on physical factors, since operation flow can be guaranteed, even with a lower performance, by the available equipment while the anomaly is solved.

Corrective strand

Despite the effort dedicated to the preventive component, it will not be possible to prevent all scenarios. In the event of support activation, the key is to ensure a positive attitude towards the customer predictable pressure and immediately focus on the solution, because the window of intervention is minimal. Correctly assessing the support request is necessary because it can be presented in several ways, which include information requests, operational errors or simply process unawareness.

The expectation is the quick resolution of the specific case to allow operation to continue, and then proceed to the root cause identification and resolution, with the crucial goal of preventing further occurrences. It is suggested as a good practice to follow an approach similar to the one described by the Ford Motor Company's 8D method (http://quality-one.com/8d). Following the resolution, always assess the need to add the scenario to the control mechanisms of the preventive strand.

Finally, I would like to stress the importance of having a definition of the average stoppage cost (cost / hour) in the operation. This indicator is very useful when value support strategies success. The support success is greater depending on the lower number of requests for assistance (Incident Ratio) and the ability to overcome them within the expected timeframe (SLA). So the better the performance in prevention and agility in resolution, the greater the likelihood of reaching it.

The challenges of providing support to real time operations

Follow Us

Sign up for our newsletter

Recent Posts

Posts by Topic