The outage gremlins- What are they? Where are they?

The outage gremlins- What are they? Where are they?

 

The European Union Agency for Network and Information Security (ENISA) published its report on outages within the EU Telecom Sector during 2015. This agency takes in and collates reports from 21 EU countries on the details of major telephony outages. The main principal of this exercise is to understand the risks that are currently affecting the service and to develop ways of mitigating against them.

Before I started to read this article, I had imagined that I knew what it was going to say. I expected it to confirm the idea that cyber-attack and other such malicious mischief would be the main source of angst amongst telecom providers. After all, the cyber threat is well publicised and very high on most organisations threat list. This ever present menace was very powerfully demonstrated by the” New World Hackers” on the 22nd of October 2016 this year when they caused DYN, the DNS provider to lose services for most of the day affecting PayPal, Twitter and Facebook to name but a few. And again in May this year with the ransomware attack that went global, particularly affecting the NHS.

Imagine my surprise then, when I found that the root cause of 138 outages from 21 countries was as follows: System Failure >68%, Human Error >21%, Natural Phenomena >8% and finally Malicious Actions >2%. These figures are top line figures which the report breaks down into more detailed areas, but the general picture these figures produce are sufficient for this article.

Another report published in September 2016 is the Marsh report on global loss trends and the causes of power generation claims. Again I had expected cyber to play a part in the figures, particularly as many power generators and utility service providers use the internet of things to control various elements of their operation, but not so. The top line figures are as follows: Equipment failure 42%, Human Error 30%, Other 28% and Malicious Actions 0%.

I am not trying to say that the threat of cyber-attack has gone away, on the contrary, it remains a constant, viable, top level threat to us all. Our IT strategy no doubt takes these threats into consideration and provides appropriate counter measures.

However, I did wonder if this breakdown of statistics would be true of other businesses and service delivery operations that had suffered outages. If so, then that knowledge could provide an early opportunity to mitigate against some of the 70% of potential service interruptions that are caused by equipment failure. At the same time adding resilience to the organisation and inform both the BIA and the BC plan.

This process could amount to simple actions such as regular servicing for equipment, phased transition into and out of service, early sourcing of long lead in equipment and regular reviews of capability and load sharing. Not the answer in every case, but an area that is worth visiting.  I am aware that many companies like to “sweat the assets”, and so may not see this as a viable proposition. To counter that notion, I suggest that a reduction in equipment failure by a few percentage points would translate into larger savings and less downtime over the long term.

In the same way, whilst it is impossible to eliminate all mistakes, human error can be reduced by timely, adequate and regular training. Again, looking for a reduction of outages caused by human interaction by a few percentage points can only be good for the business income generation and service delivery.

The process that I am describing here does not just apply to the subjects mentioned in the reports, it is an approach to take across all BC activities, and is particularly useful when looking for mitigations to risk in the BIA.

Dave Brailsford, the Team Sky cycling general manager and performance director in 2010 described it as the “aggregation of marginal gains.” He says it is “the 1 percent margin for improvement in everything you do.” He believes that if you can improve every area related to your work, in his case cycling, by 1%, then those small gains would add up to a remarkable improvement. You only have to look at the record of Team Sky for conformation.

If your organisation collects this type of data, I would commend you to examine it, and look to mine it for this small data. You only need a 1% improvement here and there to produce benefits.