Operational Resilience and You
We are joined by Zog Gibbens who is an Enterprise Architect with a strong retail background. Zog has a wide range of experience as a technologist as well as an architect. We asked Zog to share his thoughts on Operational Resilience. It is a subject close to many hearts, especially in regulated sectors such as Financial Services that are subject to special scrutiny in this area.
What is Operational Resilience to you?
In short, operational resilience is ensuring the business service can continue to operate. From a technical lens, this could be systems that you can fail-over to, to ensure continuity. From a practical perspective, this could be backup solutions on the front line. I remember when we had Credit Card Manual Imprinter machines in stores to take card payments if our Point of Sale or Network failed us. I haven’t seen one for over 20 years now.
In IT, we think about the IT systems that need to be resilient, however, with the world as it is now, those not in IT possibly think about Operational Resilience even more so than IT professionals. Think about the impact in our schools, the operational resilience is to revert to online teaching, not even a consideration just a few years ago. We have human operational resilience whether that is on a grand scale with pop-up distribution centres as we saw during the pandemic or putting a Gazebo over a BBQ because we don’t trust the weather.
You have lots of experience in the retail sector. Due to the Customer facing demands this must need channels and systems to be highly resilient?
In retail specifically, the margins are tight, and in some sectors even tighter today since the pandemic began, with rising prices, supply chain issues, Brexit changes etc. It’s a tough sector, and you must be careful where you put your money as a business, therefore how you spend on Operational Resilience. At the end of the day, it is the front line (store and website) that is key, they must have the ability to execute a sale.
A store must have the ability to open the doors and take a financial payment, obviously they need stock on site and therefore have broad dependencies on humans (sickness, experience, employees leaving retail etc.), transactions, stock replenishment, supply chain etc. Online sales differ slightly, you need your website, payment service and order queue service for fulfilment; but you are less dependent on stock replenishment, supply chain etc. Notwithstanding that the consumer must receive their goods in a reasonable time or you lose them to a competitor in the future.
What is your view on the key things to consider when facing into Operational Resilience challenges & solutions?
From the lens of a IT Architect we have seen a considerable shift in recent years. For operational systems we have the transition from monolithic IT systems to microservices, enabling us to focus or scale our resilience on our most critical microservices and spend less on the less critical services. Alongside we have seen significant growth in cloud adoption and SaaS services that don’t require so much of our attention for resilience, although that does need to be considered during vendor and product selection.
A big risk I have seen however, is the marketing from the cloud vendors, that they provide multisite resilience, failover sites etc.
Whilst this is true, there is additional setup and configuration, additional services and network routing to be setup, failover testing, data synchronisation, egress/ingress bandwidth, the list goes on and the cost increases. So whilst a business executive is led to believe their chosen cloud vendor has all of this covered, it is down to the IT team for that organisation to inform the business that this isn’t out of the box for free, the work and costs require funding and scheduling.
Are there any common factors or themes that you have seen in terms of poor architecture or technology that contribute to failure?
To be honest, no, not really. Operational Resilience is about risk, it is about the team architecting or building the services to be transparent with the business on the level of resilience being built in, and articulating the service levels in such a way that it is understandable to the persons consuming and making choices on that information.
The business must decide between the cost of resilience versus the risk to their operation. There are many organisations still running on a mainframe without resilience other than backups, however, they probably see the risk low as they’ve not had an issue in the last 10x years. Other scenarios may be with the backend services perhaps supporting marketing, campaigns etc. Whilst the organisation may tolerate the service being down for days, there will be an impact to sales at some point. The business must weigh up the risk of that service going down, duration of outage and cost of recovery against the cost of adding resilience (upgrades, standbys, backups etc.) and when there is safe time to make those changes.
Real life problems I have experienced haven’t been a result of IT, instead location or weather, such was flooding encroaching on the data centre, or a leak in the roof dripping onto a SAN cable. Not much you can do as an IT architect, other than warn of the risk.
Are we moving to a world where we should take Ops resilience more seriously?
The organisations I have worked have all taken Operational Resilience seriously and I have always taken care to ensure transparency on the resilience of the service.
The risks are the perceived simplicity that you can spin up a replicated service rapidly, and whilst that might be the case with mature DevOps, InfraOps, SecOps, MLOps etc. it isn’t the case for the majority of well-established organisations. To counteract that, with the evolution of cloud services, it is far easier, and more cost effective, for an organisation to have several environments enabling different types of testing such as regression, non-functional, QA, UAT, Prod-Copy etc. to support a significantly more robust path to production for new feature releases.
Another angle to consider is the growth in Citizen IT, and the capabilities provided to non-IT teams with self-service solution available through Office 365, Power Suite and similar services. This has moved on from the constraints we had with Access Databases to quite powerful services that an organisation could rapidly become dependent upon, but without the necessary protection for sensitive data, accidental deletion of a key file etc. It will be interesting to see how IT changes as the self-service and democratised IT grows.
Zog reminds us that these are his personal views and don’t necessarily reflect those of his current employer, previous employers or clients.