How To Develop A Disaster Recovery Plan
Where Do I Start?
When you (or your superiors) have identified that your disaster recovery plan is woefully inadequate, or if a recent disaster highlights the need to develop something, anything, to mitigate your company's risks, there's a number of steps you can take to at least start thinking about what you're up against.
Begin With a Risk Assessment
Okay, I know, this is the boring part. But aside from your valuable time, it's free!
How high above sea-level is your building? What's the nearest body of water? Is it near a river or creek? Why so many water questions right off the bat? Because it's one of the most common enemies to computer rooms.
Equally important is your fire alerting or prevention systems. If your computer room doesn't tell you (or the fire department) that it's on fire, then it's as good as gone. If your computer room doesn't have a fire suppression system, it's probably not going to survive. If it does have a fire suppression system, is it the water kind? If so, you almost might as well not have one. Servers are like cats in that they don't care for water much.
FM 200 is apparently the best they have out right now for enclosed computer room fire suppression.
Speaking of fire, do you have any fire extinguishers? In case you just happen to be around when the fire breaks out, get some electronic friendly fire extinguishers and have them checked often. One right outside the door to your computer room, and one inside (depending on the size of the room, of course). Don't be a hero! If it's a raging inferno, it's better to have to resort to your alternate data center than be scarred for life. If you don't have an alternate data center or disaster plan, then carefully weigh the options of losing your nice paying job, being scarred for life, or the terminal illness you sometimes get from fire called death.
Why would you lose your job? It's not your fault that the fire happened, right? Well it won't just be you that loses their job, but everyone else that depended on the ERP systems, databases, and/or applications, and the customer lists, open/historical orders, vendor lists, company secrets, etc. that they contain that keep a company running. In other words, your company is probably no longer going to be in business, so it can't keep you on its payroll, whether they blame you for your datacenter loss or not.
What about airport runways? A lot of businesses are located close to an airport for various reasons. Can you hear the roar of the engine as the planes pass overhead? Can you go outside your building and count the rivets on the underbelly of any of these flying busses? If so, you're at risk! It's a very small risk, but it can happen.
What kind of security system does your company use? Passive or monitored? Do you have separate access for your data center? If not, you should probably think about having at least a self-closing and locking door. If your servers are at your desk, you or your company is obviously not concerned about their data, and you don't need a disaster recovery plan. Stop reading any further! Otherwise, theft or vandalism is another common "disaster" that can be cheaply mitigated.
Raised floors can help mitigate a small flash flood, but sometimes all the wiring is hidden underneath, making any benefit in this area moot.
Believe it or not, an air conditioning malfunction can be a risk all by itself! If your air conditioning goes bad in the middle of the night, your servers may be toast by the morning, especially if it happened on the weekend when no one's there.
What about electricity? The building I maintain loses power regularly, and sometimes for hours or longer at a time. I hope you're at least using UPSes . However, that's not where it stops. How much time will it last if there were an electrical outage? Will it give you enough time to gracefully shut down the servers or turn on the generators? Will it notify you if the power goes out? Does your building have an automatic generator (and a backup generator)? Are you laughing at me at this point? You don't need an expensive generator system if your datacenter isn't mission critical, or, even if it is, get some high VA voltage regulating UPSes, and get enough portable gas generators to run your whole data center. You'll probably need some specially made extension cords (at the right guage) to fit the generator and your UPSes, and long enough to reach to your parking lot (you probably don't want to run the generators inside or near the building; exhaust fumes can find the nearest air leaks into your building, and no amount of Red Bull will keep you from falling asleep and dying from the carbon monoxide poisoning, no matter what the commercials might tell you). Portable generators typically do not produce clean power, so make sure all your UPSes regulate voltage, or you might damage your servers. You might even damage your UPSes, so only use portable generators as a last resort.
How can you tell how long your servers will run on your UPSes? Unplug them! (Unplug the UPSes from the wall.) With a stop watch, see how long it takes before the UPSes start to complain and write down the time. If possible, do this when no one else is using your servers, and be prepared to plug them right back in when the UPS starts complaining! Also, it's a good thing to test the UPSes before you do this. Most UPSes have a button that allows you to test it while everything's plugged into it without disruption. Test your UPSes at least once a month.
What kind of backup strategy have you employed? I hope you're backing up your data at least weekly, and sending this data offsite to a reputable storage company (with disaster resistant facilities). How often do you check your backups? How long would it take to restore all of your data? Is that length of time acceptable to management? How long would it take you to get duplicate equipment? If it's not duplicate equipment (exactly the same or as close as possible), then a system restore is probably not possible, and you will probably have to re-install your applications. Do you store copies of your applications (media, like CDs and DVDs) separately? Would you know how to reinstall said applications to a brand new server if needed?
What about tornados, landslides, hurricanes, earthquakes, terrorism (dirty bomb or nuclear), insurections, war, or even avian flu and plague? Yes, you have to guard against personell losses also.
Write down all the disaster scenarios you and your coworkers/business leaders can think of. You'd be surprised at some of the things they will say! Alien invasions, meteors, raptures, the second coming...After you compile a list, assign it some values as to how likely they would happen. Then assign them a value at the amount of damage they may cause. It may be as simple as losing a power supply, or as complex as a hurricane or dirty bomb.
Then, ask your business leaders how long they can be without technical services (databases, applications, ERP systems, etc.) Tell them to be realistic, as the cost of having the backup systems they will need goes up exponentially the shorter they can't be without some system or another.
There's two terms that you can throw around that'll make you look like you know what you're talking about, and that's RTO and RPO.
RTO (Recovery Time Objective)
RTO is "Recovery Time Objective". It takes into account how long it will take you to get back up and running after a disaster. Think hard about this, because this basically means how long will it take you to duplicate the network, computer systems, applications, all from your computer vendors, application vendors, and tape backups. What about securing a facility and how long would it take you to do that? Buying the equipment, reinstalling applications, database software, and ERP systems take a certain amount of time. How long will it take you to restore the data on your backup system (which, don't forget, you're going to have to buy). What about your phone system or PBX? Do you have a backup of the configuration (if you can even find your aging phone system for sale somewhere with all the licenses installed)? Did you know it can take up to 30 days or more to order a new PRI (local voice T1) line? If the only one you have goes dead (because some backhoe operator cut the lines leading into your building), it can take days to fix it.
RPO (Recovery Point Objective)
The RPO is the "Recovery Point Objective", so that when you're done purchasing equipment, finding a place to house it, and reinstalling your applications, and restoring the data, at what point in time was that data backed up? If you only do weekly backups, and you do said backups on Friday nights, and you ship those backups offsite on Monday afternoon, you could lose up to 9-10 days worth of data right off the bat! Let's say you satrt the backup on Friday at 6:00 PM, and the first thing it does is backup your critical database. Let's say this takes an hour to backup. It's now 7:00 PM, and it's backing up the rest of your network, and it runs until Sunday or Monday morning. These tapes go offsite as planned. The next Friday, you start your backups at the same time, but before the offsite storage guy comes and picks up your tapes on Monday afternoon, a fire tears through your datacenter. The last time your database was backed up was 7:00 PM the previous Friday! So, that's 5 hours on Friday, then Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday and 13 hours on Monday (assuming that's when the fire happened, you had left the tapes in the computer room, and the offsite guy hasn't picked up your tapes yet).That's just shy of ten days of RPO!!! Can your business afford to rekey 10 days' worth of data, presuming that it had backup paperwork to refer to? What if the whole building was destroyed, and there was no backup paperwork? Yikes!
These are just some of the first steps to take in developing a comprehensive I.T. disaster recovery plan.