It’s crucial to remember that the primary goal is to uncover the blind spots in the project.
On December 13th, the team working on a large-scale mobile application project gathered in our Montreal office to tackle a terrible catastrophe (fortunately, a fictitious one). The scenario was as follows:
In 5 hours, our client is set to launch its biggest campaign of the decade—no less! But disaster strikes: AWS services are down across Canada. After conducting an urgent PIA (don’t forget the Law 25 ), the decision is made to migrate the entire infrastructure to the United States.
Although this scenario is far-fetched, the exercise itself is incredibly valuable. In fact, such simulations are increasingly common in the industry. In the rest of this article, we’ll discuss the various benefits and share some tips to help you get the most out of these exercises.
Choosing the Right Disaster
Every system is unique, and the scenario you choose should be tailored to its specific characteristics. A scenario that is too challenging can be demoralizing, while one that is too simple might add little value. Beyond the technological maturity of the project, a critical factor in selecting a disaster scenario is the team composition. If the original creators of the system are no longer part of the team, it’s often necessary to lower the difficulty of the scenario.
It’s crucial to remember that the primary goal is to uncover the blind spots in the project.
Taking Notes
Many organizations have disaster recovery policies that have never been tested. The “game day” presents an ideal opportunity to put them to the test. Throughout the exercise, it’s essential to document pain points and necessary fixes. The aim of the exercise isn’t to solve all the problems immediately but to identify, document, and address them afterward.
Finally, taking notes only adds value if it leads to concrete actions. It’s vital to schedule a post-mortem session in the days following the exercise to calmly analyze the results of the simulation. Ideally, tangible actions related to the project and its policies should be incorporated into the next sprint.
Balancing Speed and Learning
Even if the scenario includes a time constraint, it’s important to make time for learning. For many team members, such an exercise is a unique opportunity to understand how certain system components work or to familiarize themselves with less common concepts in a developer’s daily routine. For instance, during our last exercise, a team member had the chance to grasp the full process of configuring DNS records.
Some moderation may be needed if someone becomes overly competitive and prioritizes speed alone. Conversely, adding some pressure can be beneficial if the team isn’t taking the exercise seriously enough. One of the main objectives remains testing the team’s resilience in difficult situations.
Conclusion
For a relatively low cost, disaster recovery exercises are an extremely valuable tool for any organization that values operational excellence. They help test the team’s resilience under stress, uncover and address blind spots in the project and internal policies, and preserve institutional knowledge that might otherwise fade over time. For these reasons, we’ve decided to include these exercises as an option in the maintenance plans we offer to our clients.
For those curious, the team managed to recreate the environment in 3 hours and 41 minutes, which (barring another catastrophe this year ) keeps us compliant with our 99.95% SLO. This success was made possible through the use of modern technologies such as ECS/Fargate and CloudFormation, along with a robust database backup strategy.
The exercise was still highly relevant! We identified poorly documented Lambda functions, an outdated SQS queue, and several other opportunities for improvement, which will be implemented in the coming weeks.