How a tech startup masters Disaster Recovery

By Adrian Butter
Published 7 years ago

disaster recovery plan

The technology news has recently seen a few Disaster Recovery (DR) stories that remind us, as both consumers and professionals, of the need to be secure in the knowledge of the service that we provide to you.

The expectations for digital products have never been higher -- we all want intuitive insights and functions that are always on, matched with vast data sources, alongside high performance, across multiple devices.

For any technology company there needs to be a laser focus on the service and the data within that service. I regularly ask myself:

Is the service stable, and operating within its expectations?
Is the service performant and able to serve its client?
Is the data secure?
Is the data backed up?

And further… if we lose the whole stack -- could we re-create it? (really quickly!)

DR Fire Drills

As part of a rolling calendar of 'fire drills' the whole DevOps team at E Fundamentals run these failure scenarios ensuring that we have a common understanding of the services and all its elements that are required to provide the service. By taking down each part of the service in turn, we practice, learn and improve.

Like many cloud products, our service is made up of a high number of interconnecting parts; In practicing system-loss we spread knowledge among the team, we tune our ability to spot failures and we instinctively follow the detailed recovery plans that are in place.

Each time we run our process, we learn not only among the wider team but also what doesn’t work, or has stopped working. Our natural pace of development means that our platform is always changing, and our ability to restore services by script needs to increment in turn.

Finally we look to improve. We benchmark on two core values -- The Recovery Time Objective and the Recovery Point Objective (RTO and RPO). In basic terms that means -- how quickly can you restore how much of the service? In product terms that means how long is the service unavailable for, and following a failure, how much data will be restored to the re-created system? A pioneering DevOps team’s target will be: the same day and everything (all the data used by your clients). And whilst all this is going on behind the scenes, we also set up processes to ensure that data is still accurate and accessible for our clients during this drill.

Practice Makes Perfect

In the last fire-drill (March 2017), we deleted and restored both the core back end data gather system and the client reports/dashboards system. In running this across environments (test and production systems) we involved all-hands and made this an immersive experience, with process documents open, whiteboards ready and of course the stopwatch! In a rolling ownership model, one developer was shadowed by the rest following a detailed procedure step by step. Not only did we re-write the document as we went, but also sought to increase the automation at every step, and updated scripts as we went.

It was not only pleasing to see a positive result, but also see the restoration times fall in each iteration (through environments). Our first run of 2hrs+, was reduced to 40 minutes for our data gather system (that creates our daily insights across thousands of products).

Amazon's Tech Failure

Within the same week we heard of two stories of DR process failure -- Amazon S3 and GitLab -- that acts as a reminder to be prepared at all times to the best of your abilities. In both cases great products were temporarily lost not due to the underlying brilliance of the product and platform -- but human error and a slip in process; a typo in the coding bought down the whole system.

We also learned a thing or two about boundaries in our own technology stack that reach far beyond our 'walls'. During one cycle, we overlapped with the Amazon S3 outage -- but we mainly use Google Cloud -- so that’s okay right? Well, not if parts of your process use Amazon S3 that may well include 3rd party dependency services and Docker container storage.

The exciting world of cloud makes many things possible, but it also creates a web of dependencies that you need to be well aware of. So, if 'they' are down, so are you...

DR and system protection may not be the most exciting part of the Product Creation process -- but always ask of your product and your team -- what if we lose the whole stack? And whilst you’re rallying around that cause, do remember what the two particular services mentioned above did brilliantly during their experiences -- communicate, communicate, communicate. A void is always filled with pessimism and transparency is the natural antidote.

Are you prepared? The clock is ticking...

Photo Credit: Olivier Le Moal/Shutterstock

Adrian Butter is the CTO of E Fundamentals; an eCommerce analytics software for the enterprise. He has a background in technology, consulting, product design and programme delivery from Accenture and Deloitte and now leads a team of in-house Developers to deliver a world class technology to global brand owners.

No Comments

Comments are closed.

How a tech startup masters Disaster Recovery

Recent Headlines

TCL 50 XL 5G Android smartphone hits Metro by T-Mobile: Big features, small price

The increasing sophistication of synthetic identity fraud

The NIST/NVD situation and vulnerability management programs

How AI will shape the future of the legal industry

Start menu ads are rolling out to all Windows 11 users -- here's how to turn them off

Qualcomm introduces Snapdragon X Plus for Windows PCs

Free test lets you check how websites measure up to privacy rules

Most Commented Stories

Say goodbye to Microsoft Windows 11 and hello to Nitrux Linux 3.4.0 'pl'

The stunning Windows 13 -- yes, 13! -- is the Microsoft operating system we want

Microsoft 'improves' Windows 11 by bringing ads to the Start menu in the US

Microsoft is up to its old tricks yet again -- Windows 10 users harassed with full-screen Windows 11 upgrade warnings

Outrageous: Microsoft to charge $61 for Windows 10 updates -- consider switching to Linux!

Windows 11 slammed for its 'comically bad' performance even on high-end hardware

Microsoft releases preview version of Office 2024 for Windows and macOS -- download it now!

Easter giveaway! Get a licensed copy of 'VideoProc Converter for Windows/Mac' (worth $78.90) for FREE