Chat with us, powered by LiveChat

Disaster Recovery Part 2: Common Failures & Predictable Recovery

Albert Ahdoot

This is part two of two in our series on Disaster Recovery (you can find part one here), where we will cover the topics of common failures & predictable disaster recovery methods.

A key component of a Disaster Recovery plan is incorporating processes for different failure modes. Implementing a plan that only covers serious infrastructure failures would be overkill if the issue were merely a local disk failure. Below, we get into the specifics of how redundancy can protect your system against system failures, infrastructure failures and natural disasters.

System Failures

disaster recovery system failures

System failures are the most common and tend to be the easiest and cheapest to prepare for. These should be your first concern with your Disaster Recovery plan. Such failures cover the components of a server including disk failure, power supply failure, RAM failure, etc.

To protect your system against component failure, it can be as simple as having a backup hardware component that your system reverts to, or complex as multiple disk images spread across multiple servers with backup cooling and power, ensuring redundancy across the entire system.

Infrastructure Failures

It doesn’t take that large of a business to need a system more secure than one operated out of an office closet in the company building. A severe power outage can last many days or weeks, and it is rare that an office building’s backup systems are prepared to operate for this long.

This is where colocation and dedicated data centers come in. A dedicated data center will have expert staff on board to handle all the different aspects of a fully redundant system. A Tier-3 comparable colocation facility, such as those operated by Colocation America, will have redundant (N+1 / 2N) UPS systems and redundant cooling (N+1 CRAC Units) systems that are connected to diesel generators that are banked using Automatic Transfer Switches (ATSs). The ATSs provide redundancy against generator malfunction or failure.

automatic transfer switches for backup generators

Automatic transfer switches

Furthermore, good colocation facilities will also ensure that they have adequate fuel to last for at least seven days, and have preferred fuel supply contracts to ensure adequate fuel supplies past seven days.

Clearly, it is not possible for every business to maintain this level of redundancy at an affordable cost. Colocation facilities can amortize the cost of maintaining such systems due to the scale of the operation and the fractional cost of the systems spread over many customers.

It’s also often cheaper than many businesses realize, especially considering the level of protection a data center can provide over an in-house setup. A data center’s job is to make sure your servers are running 24/7.

Another consideration besides power is Internet connectivity or bandwidth. The best colocation facilities make use of Internet backbone connections to ensure that connectivity problems have limited impact—if they ever arise.

If handled correctly, colocation is one of the most powerful steps your business can take to compete with rivals whose in-house network capabilities might exceed your own.

Natural Disasters

For larger companies or companies where a higher level of redundancy than mentioned above is required, distributing your systems across data centers in different geographies can be a necessary option. As an example, during Superstorm Sandy, a data center in New York experienced issues so serious that it required a bucket brigade formed by customers and their staff to bring fuel up the 17 flights to their back up generator—full story here.

natural disasters

This is not something any serious business would want to deal with, so spreading your mission critical data and compute systems across a variety of geographic profiles becomes the required option to ensure that they keep running.

Some examples of issues that one may encounter with a single data center approach:

  • Due to a severe thunderstorm, snowstorm, or even a tornado, power goes out at a data center and backup power systems may or may not work
  • With extended outages, back up power systems run out of fuel and fuel supply lines for emergency fuel delivery are also disrupted
  • If there is torrential downpour and the data center is in a flood zone and it gets flooded
  • A severe heat wave hits and cooling systems break down
  • An earthquake hits and cabinets are knocked over and seriously damaged.

Or an entire utility fails as was the case in March 2011 when a tsunami seriously damaged the Fukushima Nuclear plant in Japan. The point being, these disasters are not uncommon and it is prudent to be prepared for such events.

A solid DR strategy that incorporates multiple geographies that are quite far apart (for example having a DR strategy that incorporates a data center in San Francisco and another on in Oakland is not viable) can help mitigate downtimes that happen because of natural disasters. A good managed service provider with a solid DR service can help you set one up.

Ultimate plan “Just short of the Moon”

The ultimate Disaster Recovery plan would incorporate a worldwide distributed system with data, compute and power redundant across multiple geographic profiles in different countries across the planet. Basically a data center everywhere except the Moon.

data center in a chapel

This would probably include data centers such as this underground “fit for a James Bond villian” model. This may seem like a laughable proposition, but underground data centers are actually more common than you might suspect.

Risk/Cost Analysis of Disaster Recovery

Having reviewed how redundancy can protect your system against systems failures, infrastructure failures and natural disasters, a Business Impact or risk/cost analysis can be performed—as discussed in our first blog post in the series.

One of the simplest thing that you can do is relocate your systems to a Tier-3 comparable data center and protect yourself from infrastructure failures. Following that, if a colocation provider can also provide other managed or cloud services, you can then have a conversation with them about disaster recovery, tape backup or other managed hosting options that can give you additional resiliency and improve your uptime.

The world’s most meticulous planning for disasters striking your IT systems won’t mean much if you can’t reliably recover your systems once disaster strikes. If you’ve been following along with our series on Disaster Recovery, you’ve already done your Business Impact Analysis (BIA). If not, hop over to part one of the series as the BIA is going to be a critical part of your recovery plan.

Business Impact Analysis (BIA)

Your BIA will act as a guide to which systems are considered high priority when implementing your Disaster Recovery plan. The most vital systems, naturally, should be addressed first. They will also be the first to be tested when you begin to develop your disaster recovery plan, and down the road as you implement a testing schedule to ensure your plan always works in case of a disaster.

Testing Your Disaster Recovery Plan

Testing is another integral part of your DR planning—without it, how will you know if the systems you’ve put in place for recovery actually work? How will you know whether or not you’ve accounted for all of the issues you might encounter? Will you be sure you included the most efficient processes for recovery?

disaster recovery test plan

Considering how many possible problems you could encounter that would prevent your system from operating as planned, it’s best to view your initial plan as the starting point, and regularly add improvements to the overall strategy.

With all of this in mind, it’s hard to believe that some businesses don’t test their Disaster Recovery plans. Yet, according to SearchDataCenter, most businesses actually don’t test out their systems, whether it’s the whole plan or just parts of it.

Besides making sure that your DR plan actually works, testing will also help you find ways to make the execution of your DR plan more efficient. For example, as you test out the system, you may find a way to recover your company’s email faster than your original method.

Additionally, it’ll allow you to discover areas your initial plan didn’t cover, reinforcing the need to continually maintain and update your DR plan as new needs arise.

For your IT team, testing will allow them to demonstrate your organization’s ability to recover from a disaster, something that will be very important from management’s perspective, and may even be a requirement if you work in a regulated industry.

Furthermore, the initial test of your system is only the first step—your DR plan should incorporate a schedule to continually test the recovery of your systems. A regular testing schedule allows you to continue to identify inefficiencies and pieces missing from your original plan. Adding a new application to your workflow may introduce a new variable that you may need to consider when performing recovery. This may be configured as you review your DR runbook with your DR partner.

The most important reason, though, for setting up a testing schedule is because things will always change. Your IT systems will never stand still as your business grows and technology improves. What worked for recovery one quarter may not work the next as your critical systems are improved and scaled to handle increased demand.

A testing schedule will provide training for your entire team—it’s easy to forget or become inefficient at a process you only perform once a year. Also, as new team members join and former members leave, it becomes vital that new ones are properly trained in DR procedures so that your company’s DR expertise doesn’t leave with the employee.

While there are various types of tests you can run, the following are the most common:

  • Run-Book (Checklist) Tests: A set of run-book items the Managed DR team and the corporation’s internal IT team are supposed to follow in the event of a disaster. A periodic review of the run-book may uncover some inconsistencies or modifications that may be required if an internal process, application, or software has changed. An actual DR drill is not required in this case. This test can be conducted every quarter with the Managed DR Services partner.
  • Simulation Tests: Used to simulate possible scenarios that would require DR. It can be a scenario where a particular set of applications go down, or a scenario where the whole IT staff has taken ill (for example if the IT staff all go out to a restaurant the previous night and have succumbed to food poisoning or a hepatitis outbreak), and the Managed DR partner is tasked with supplementing the basic IT functions. The idea of a Simulation Test is to simulate the actual disaster and exercise the portion of the checklist that has been chosen for that scenario. This test can be conducted every 6 months with the Managed DR Services partner.
  • Full Interruption Tests: Full scale Business Continuity exercises if all systems were to go down at the primary site. This is the most comprehensive DR test for all systems and provide assurance that the business can continue to function, possibly in a degraded manner.

We hope we’ve provided a thorough, high-level overview of the importance of Business Continuity and Disaster Recovery Planning. We’ve discussed possible disasters that can affect your business, looked at performing a Business Impact Analysis to identifying mission critical systems, and discussed common failure modes, as well as how to mitigate them and the importance of recovery and recovery drills. If you have any additional questions, download our “Comprehensive Disaster Recovery Guide” or feel free to contact a member of our team.

Leave a Reply