Lessons from Personal Resilience
Recently I experienced a resilience failure, but at a personal level. I was running a marathon in France and a bit over 30 km in I suffered from heatstroke and required medical attention. As with any serious incident, afterwards, I reflected on what had happened, what led up to the incident, and what I could do differently in the future. But how is any of this relevant to my work in Operational Resilience?
Lack of Preparation
In hindsight, I had not done sufficient training and preparation for the marathon. While I had been doing lots of training runs, I started too late and hadn’t done enough long-distance runs.
When considering Operational Resilience, we need to start as early as possible. It needs to be part of the design process right from the start, not an afterthought bolted on at the end. Throughout your design process are you considering how your choices will impact your resilience? Are you constantly testing and adapting to the results?
As someone who has worked in technology for a while now, the threat models we face have changed drastically. We’re not just designing systems to make sure we can keep running when a disgruntled electrician turns off the data centre, we need to consider how our designs hold up against active, malicious attackers. While our cyber colleagues do great work with controls and tools to protect us, how do we cope when those fail? Many organisations do an excellent job of eliminating single points of failure and providing high availability, but how well can they cope with data corruption replicated through numerous, complex systems?
As we design our systems and processes we need to adopt an adversarial mindset and constantly attack our designs and choices. This constant testing early on can help identify potential weaknesses when they are much easier to remediate.
This came up in two areas, early warning of potential issues and identification of the incident occurring. A few days prior to the marathon I had felt a mild onset of a possible cold. I rested and tried to take care of myself but didn’t think much more of it. Talking to the friends I was with after the incident, a number commented on how they thought I hadn’t looked well the days before and of the marathon. Monitoring failure number one was not considering the early warning signs of problems and adapting my approach.
The same can apply to systems. Long-term trend monitoring can highlight potential future issues which unchecked could cause operational impacts. One example is resource utilisation growing exponentially higher when compared to user or transaction growth. Do you have a view across the systems that make up an end-to-end service and understand their interactions, and how issues in one area may cause impacts elsewhere? What are the early warning signs to catch problems before they impact your services?
The second monitoring failure was during the marathon itself. I knew I wasn’t feeling well, so I slowed down and made sure I was consuming plenty of water. I worked out I still had time to complete if I took it easier and plodded along. Having not experienced heatstroke before it wasn’t something I considered, and I didn’t think to ask my friends. In the future I’ll be looking out for different symptoms and how I’m feeling.
Monitoring of our systems and processes during an incident is critical to being able to rapidly understand what is impacted, and work to identify mitigations and the root cause. With complex, interconnected systems it can be challenging to separate out the relevant information from the noise. Do you understand the key metrics that are important to delivery of your services? What information will help speed and guide your incident response? The faster you can move into fix mode from triage mode, the better.
This is a key area to learn from. Much as we want to believe our designs are perfect, reality can always find new ways to challenge us. When we do experience incidents, we must take the time to learn from them. This may mean future design changes, it may mean additional or more focused monitoring. The key part is to take the opportunities to learn and improve our designs
When assessing our Operation Resilience we must always consider the people involved. People can be single points of failure as well. How do we handle their unavailability? In this instance, I wasn’t critical to any projects or activities. I used to refer to this as the “Big red bus” problem until I met someone who had been hit by a bus, and their project lost a critical lead for several months. Thankfully, they were very healthy and able to share the story themselves.
The other people aspect is when you are in an active incident. Severe incidents impacting multiple areas will take an extended period to resolve. Even once basic service has been returned, there are normally significant clean-up and remedial activities needed. Do you have plans in place for how your teams will handle incidents if 24, 48 hours or longer? This isn’t just your incident management and technical teams, also consider the teams that run your business processes and how they will be impacted. Are they part of your recovery? Will an incident generate large processing backlogs that teams will need to scale up or work extended hours to clear?
This wasn’t my first marathon, but I still had a lot to learn. Thankfully quick, effective medical care followed by support from family and friends meant this is an unpleasant memory and nothing worse.
We should always learn from our experiences and think about how those learnings can be applied in the future, personally and professionally.
Enterprise Blueprints is a specialist IT Strategy and Architecture consultancy helping clients create business value by solving complex IT problems. If you would like to discuss how we can help you to advance your platform thinking, bolster your operational resilience, accelerate your cloud migration programme, drive out costs from your legacy estate, or accelerate your digital transformation then please contact [email protected]