Saturday, February 14, 2009

Simple Thoughts on Disaster Recovery

Introduction to a disaster recovery plan
In its simplest form a disaster recovery plan is just what it sounds like; you plan to recover from a disaster. This may sound vague, but much like the advice “you should purchase insurance” the details differ for every person, group, department, and company. The insurance you need is not the same as the insurance your 19 year old college bound daughter needs, and it is certainly not the same insurance Xerox Corp of America needs. However, I will try to give some general advice about disaster recovery plans for small businesses, and then some simple how-to sections that will get you started.
This posting will cover:
  1. The justification for a disaster recovery plan, including how to calculate your budget for implementing a disaster recovery plan.
  2. How to determine what you need and build a plan.
  3. And a simple example of how I do backups.

Justification:
Q) Do I need to worry about disaster recovery? If I have an insurance policy would that not cover everything?

A) Yes, you should still have an active disaster recovery plan. No your insurance policy does not cover the value of your data loss and likely does not cover the income you will lose from being unable to conduct business. Your theft/fire/flood insurance is great for replacing your hard assets, such as your building. It may even compensate you for some of your soft assets. Yet it can not replace soft assets, such as your operational data, and the compensation will never equal the damage from the loss.

Q) How much will it cost me?
A) Less then you think, and far less than it’s worth. However the end cost depends on your needs. A range for most small businesses, that I feel confident about, start at a variable cost of $250 per year on the low end and up to a total allocated cost of $25K a month on the high end. I know that is a big range, so I'll try and break it down a bit over the course of our discussion.

Q) How will I know how much my company should budget for this?
A) That depends on what you need to do. One obvious answer is how much can you afford? If you think you can’t afford anything, then it is time to see what other opportunities you need to first forgo in order to take care of this; because what you cannot afford to do is not implement a recovery plan. First, perform some basic risk assessment. You are really looking for two key components: how likely is it that something will happen? and what will it cost when that something occurs?

Calculating the risk and reward
(Skip if you are not the cost accounting type).

Risk Beta
In it’s simplest form Risk Beta is the likelihood of a specific event occurring in relation to the likelihood of the group of event happening. This helps us decide if a particular investment is good or bad when compared to other investments. What is the likelihood you will have a catastrophic data failure in some period of time? Let us use a five year period, assume you have one server in your office, and you have no paper files, you have no backups. Now let's add up the odds for each way you can lose the server, fire, flood, theft, hardware failure, etc. happening in the next five years. We will call this number X and it will range from 0 (no chance any of it will happen) and 1 (100% chance something will happen). What X actually is depends on a lot of variables and some statistical shortcuts. In addition you may choose to weigh certain effects differently based on impact cost. (Side Note 1: Consult with a professional?)
If you really simplify it then most people can come up with a rough number from a general experience. Have you or any of your neighbors been broken into? Have there been any fires or floods? How often does this happen? If two of your 4 neighbors have had a break in during the last five years, and their computer equipment was damaged or stolen, then you may assume you have a 40% chance ( 2 businesses out of 5 total in your area) of being broken into in a five year period. Each chance of a failure is cumulative and should be added together, at the same time, each option of redundancy reduces the chance. If you have a 40% chance of being broken into and a 100% chance of losing half of your computer equipment during the break in (50% chance of losing any one computer) then the chance you will lose your server is 40% x ½. If you also have a 10% chance of fire then you would have a (40% x ½) + 10% or 30% chance of data loss over five years or 0.5% chance you will lose all of your date in any given month. Similarly if you have two locations each having a 0.5% chance of total data loss over any given month, and each site backs up the other site, and it takes at least one month to get a site back up with the data from the other site, then you would have a chance of 0.5% * 0.5% or 0.025% chance of data loss. As you can see the back up reduces the likelihood of total data loss considerably. For our purposes we will make up a number like .0234 or a 2.34% chance for total data loss.
Please note that a comprehensive qualitative risk analysis which separately handles each risk, vulnerability, and control is an amazing asset and should be required for ever medium business and many small businesses. In addition there is more to worry about than just one server and theft.
Here is a sample chart of how data loss can occur.


Risk value:
What is the data worth? The value of the risk is the total potential loss times the risk beta, ie. likelihood the loss will happen. Again, let us assume you have no data that holds a legal responsibility for loss, such as client social security numbers or medical records. Odds are high that if you had to put your staff through something like HIPAA training you have professional IT people on hand who understand these issues. So those situations aside, let’s look at just the effects to your average small company. Typically we like to say that 45% of companies that suffer a catastrophic data lose never reopen, and 90% are out of business within two years. Again I am greatly over simplifying this, but let us say that over our five year period you could expect to earn ten million in revenue, or two million a year. Now let us say you go out of business exactly two years after losing your data. Now I realize you would not have earned to your potential during those two years, but we are keeping things simple. In essence you lost the potential revenue for the remaining three years, or six million. You have a 90% chance of this happening if you lose your data and a 2.34% chance of the loss occurring. So really you are looking at a normalized cost of ~$126K over five years or ~$25K per year.
Return on Investment (Simplified)
Now if you have a current risk beta of 2.34% for a risk value of $25K/year and you can reduce your risk of loss (beta) down to 0.234% then your new risk value of $2.5K per year and thus you have a return of $22.5K per year for the investment. Now let us assume you can do this (reduce the risk) for only $2.25K per year. This would give you a 1,000% return on your investment. What you actually need to spend, what kind of a reduction in risk you can get, and what are your opportunity costs, are all things to consider when making your budget and evaluating the project. Now, many people will arguer that cost reduction and risk abatement are not the same as revenue return, but that is a discussion for another time. We are, after all, trying to keep things simple.
(Side Note 2: Does every business require this?)

How to plan
If you’re lucky enough to be working in a company that has a continuity plan, take a look at it and pull out the metrics by which you need to measure success. The main thing to consider is how long can you be offline, as a whole, and as each individual piece of your company. Odds are that your creative team can be offline longer then your accounting department, which could probably stand a longer outage then your sales force. Then again, perhaps not. It depends on what drives your business. No mater how you stack it every project gets broken down into 6 stages: Investigation, Evaluation, Proposal, Implementations, Documentation, and Testing. Please notice I place documentation towards the end. This is because what happens during implementation may be different then what was originally planed. In addition documents need to be constantly updated to stay useful. Similarly testing needs to be regularly performed. I can not stress this enough, no mater how good the plan, no mater how foolproof you feel your implementation is, you will find issues during testing. Addressing these issues makes all the deference when a real disaster strikes.
A disaster recovery plan, in essence, determines how you will rebuild your business. For any department include: each workers PC, all the servers and services, phone systems, and sales system, etc. It is not enough to think about how you will back up your data, but instead you must plan for how are you going to get every working again. At this time I should also note that if your work is primary paper driven, you should consider initiating a paperless work place; at the vary least digital arching. The reason is simple, it is much easy to preserve, safeguard and restore digital documents then it is for paper files.
The big divide, as I see it, is whether you need data only, a cold site, a warm site, or a full hot site? Some essential functions may need multiple points of redundant fault tolerance with a full hot site waiting for them. Others may need little more the their data. The example is this, if sole server controlling your company goes down and you have a tape with all the business data, up to the moment it went down, how long will it take you to get the business back up and running normally? If you have just the tape and need to order a new server it could take days, or even weeks to get a box, then you need to install / configure the server and all of your applications before you can restore the data. On the other hand if you had a cold box already, you would just need to do the configure and restore. If you had something hot with everything already installed and all the data on it you could point people to it and start running. As you can see I am defining data (sub cold), cold, warm, and hot as the following
Data only: The data only plan is by far the cheapest, since very little capital expenditure is required to implement this strategy. However it also has the longest delay before you can start working again. In the old days, a data only plan would cause a work delay of a minimum of one month. However, with the advent of commodity server farms, like Go Daddy, and the increasing ability to telecommute, a good shop can have some services up the same day and the entire IT/IS infrastructure can usually be rebuilt in less than a month. The fact remains that your IT/IS shop will need to procure a new site and set up, from scratch, everything you need to get people working.
Cold site: A cold site is a place you can go and start a new at a moments notice. However you will have to reproduce everything your people need to work. For example, if you have an agreement to use an associates unused office space, which has desks, power, and basic connectivity, then you have a cold site. You will still need to provide all equipment required for you people to do their jobs.
Warm site: With a cold site everything is in place, you just need to do some adjustments and upload the latest data and away you go. The best way that I have seen this plan work is when a company has some extra space; be it the factory floor, a warehouse, or office space arranged through corporate agreements. With an in place phone system and a spare server or two in waiting, and a stack of spare desktop pc's waiting for users, you can often have a warm site configured for use in a matter of hours. This sort of set up is not regularly maintained, uses excess resources to keep capital costs down, and basically provides a head start on rebuilding the enterprise. If you need a warm site but do not have the spare resources for it, then I would recommend contracting with a company who specializes in providing warm sites for your use. Nearly every metropolitan city has one and the costs vary, but if you need one you will usually find the costs can be reasonable. However, if you have strong business relations with companies that, for one reason or another, have unused office space, I would recommend setting up an arrangement to use their space if required. Often times the involved parties can come to an extremely reasonable and mutually beatifically agreement.
Hot site: With a hot site, you not only have all of the physical plant items in place, but you regularly maintain them so at a moments notice you can switch your entire enterprise over. Obviously this is the most expensive option and is often implemented for only the most critical divisions of the company.
Concerns for manufacturers: If the company in question has critical divisions that produce goods, not services, then you will also need to consider what would happen if you production floor was suddenly unusable. If you have an ample storage, have a stock pile of completed work, and work in a primary push business, you may be fine until you can replace the lost PPE. But if you are working with a JIT inventory system in a pull environment, you may need to consider a way to reproduce your production center in a matter of hours not days. I only state this as much of what I am discussing here concerns disaster recovery for your intellectual workers; sales, marketing, accounting, customer service, etc. and does not meet the needs of manufacturing environments.

Categorizing your needs:
Of course this kind of planning extends to more than just your data. For example if you need to have your sales force on the phones and at their desks working on computers 24 hours a day 7 days a week, and you can not be down for more than an hour, then you need phones, desks, computers, the server, your data, and your people up running in a new space within an hour. Fortunately, if you really do need this kind of insurance there are a number of companies that sell this sort of service by over booking space for just such an occasion. A simplistic view of this situation is like hotelling your workers some place. Some company has phones, internet, desks, and computers for 100 people. And they sell you an insurance policy, depending on terms and conditions, that allows you to place a given number of people in that space at a moments notice and leave workers there for a period of time. They sell a similar policy to 1000 other companies and gamble that no more the five will need it any given time. But most of this is covered in your continuity plan, should you be fortunate enough to have one.
Minor loss contingency – be prepared
I should also note that your disaster recovery plan should take into account other forms of loss rather then just total catastrophic failure, caused by such things as fire. For example you need to think of things like power outages. How often does the power go out in your area? How long is it out, how much does that outage effect your workers? What level of emergency power is required to mitigate this loss? And is it worth the investment? How often do you have people overwrite their documents with or without backups? How can you ensure backups, how can you make recovery more efficient? There are a wide variety of scenarios you should consider. A comprehensive IT/IS plan will spend more time on day to day operation and accident mitigation than it will on catastrophic failure. However, it is potential catastrophic failure that usually prompts the initiation of the plan.
The Plan
Now that you know what you need to get back online and how quickly it needs to be done, you should have an idea of how to prepare for such an event and how to execute a recovery plan. There is no magic formula, however there are some decent templates out their that are industry based. In addition your insurance company may have a specific template or general requirements for you to follow, so check with them first. Personally I prefer a document that con be used for every day issues, not just total loss situations. These operational style manuals are often more time consuming to produce and maintain, but they are well worth it. I like to categorized the document by resource, functional area, and rank. For example you should know what to do if you lose a small group of pc's, or a copier, or a key component in your assembly line; ie. the resource. Furthermore you should know what to do differently if the resource occurs in accounts receivable department versus the marketing department. And finally you should know if there are any differences if the resource is primarily used by a worker versus a director. If you are a larger company you may want to consider looking into some of the new emerging standards, such as BS25999 and NFPA 1600. In any case, now is a good time to look for a template and start filling in the details.

A simple example:
Let us assume that I have three physical boxes sitting in my server room. Collectively they run three non-public web sites. All three of them are file servers, maybe one is a fax server, one a print server, an email server, a domain controller, and four applications and a unifying database are served from them. What is our plan?
First off, what were our causes of loss? 40% of our loss came from hardware failure. Typically this means failure of the component storing the data, or the hard drive. How can you protect against this? Well, how often do you need to be back up and how musch time do you have to do it? If the answer is as often and as quickly as possible, I would recommend starting with a RAID. RAID stands for a redundant array of independent disks, there are a wide variety of implementations, and if you want a discussion on it let me know, but I am going to recommend RAID 5 as a general purpose implementation. In RAID5 data is stored across many discs instead of one, and a portion of each disk is used to store information about the information on the other discs. If one disk fails you can take it out, put another one in, and rebuild the information. You can even configure a hot spare that just sits their ideal waiting for a failure and when it occurs the data from the lost disk is rebuilt on the spare. All you need to do is order another spare and toss away the bad one (read dispose of according to company policy to safeguard against data leakage) The odds of more then one disk failing (n+1 if you have n hot spares) at the same time are exponentially lower then any single disk failing. The more disks you have in the array and the more spares you have waiting for failure the less likely you will have an unrecoverable failure.
The next largest piece of the pie was human error. Typically this does not result in total loss of all the data, but it can. Most often this occurs when one person overwrites changes made by another person, or accidentally deletes a file or folder, or overwrites work they did earlier in the day, etc., etc.. There are two methods I like to use; they are the snapshot and the repository.
The snapshot simply takes a picture (aka. copy) of the data at certain times of the day. Thus if I spend all night working on something and save the last piece at 4am, and a snapshot is taken at 6am and some one deletes the file at 8am, then at 10 am I can recover the version I had on the disk the last time the snap shot was taken; in this case at 6am. Snapshots are relay just a few very convenient backups taken at certain points in time. Often they will reside on the same machine as the original data and a half dozen copies are saved before overwriting. The space required depends on how they are implemented. Those system that use a precache setup using a diff mechanism only requires space equivalent to N*X + C where N represent the average size of the changes made during any given snapshot interval, X represents the number of snapshots kept and C is some predefined overhead. In the middle are those who use a post cache and diff system. The space used for this is estimated using A+(N*X)+C where a now represents the original size of all the data being monitored. On the high end you copy all the original files and a full copy of each changed file, thus resulting in A+(K*X) where K represents the average full size of files changed during a snapshot window. Please keep in mind space requirements when selecting a snapshot method.
Windows has their own implementation (shadow copy) with their servers, as do Mac and Solaris, however you can use RSnapShot with just about system out their. Personally, for an 8-5 business, I like to take snapshots twice a day, one at 7:00am and once at 12:00pm. As most of these human errors occur first thing in the morning and just after lunch. Taking the snapshots shortly before these problematic times provides the best recovery scenario. In addition I will typically dedicate 15% of the ready drive space to snapshots, which for many companies is about a weeks worth at twice a day.
I am sure you can see the limitation on this method. Inherently it is time based not change based. I may start on something at 12:30pm finish at 9pm, have the work destroyed at 4am, and never have it captured by these snapshots! In addition if you have a dozen people working on the same set of documents we could accidentally overwrite each others changes all the time and not notice for week. This is where the repository comes in and we get into the world of version control.
Let me give an example of two lawyers and a paralegal working on a large case document. Typically not everyone will be writing to this document at the same time (if that is the case there are versions of version control for you too) instead one person may open it to review and update a section and then close it. Let us look at the case where layer A opens it to draft out a new section. Lawyer B asks paralegal C to take a look at a another section and correct the language, which she promptly does. Lawyer B then verifies the work was done. When lawyer A saves the document the changes made by paralegal C are gone. What is worse is no one know that those changes are gone. With version control you check documents in and out of the repository and the repository keeps track of every one who has a copy and all the changes. If we had version control enabled when lawyer A tried to save the changes a conflict notification would have risen indicating the document had been changed since it was checked out. Depending on what kind of repository you have it will even highlight those changes for lawyer A so that they can be accepted or rejected. Even if they are rejected a copy of those changes are still saved in the repository and you can go back and forward to look at any version you desire. The down side is if you use a repository designed for say, text files, and you start dumping large CAD files in their, you are going to run out of space fairly quickly with every change being kept as a separate file. But not to worry, there are versions specifically for CAD files that understand the file format and will keep only the changes. This is also true for products like Adobe Photoshop where file size can be a major issue.
A number of implementations out there exist from companies like Microsoft, Amazon, and Google, and which one you should use depends on what is being held in the repository. For text documents and other simple files I like SVN. It is easy to set up and easy to use clients exist for almost every platform.
Backups - The other 21%
To take care of the other 21% of you loss candidates, I like good old fashion backups. When most people think of backups they think of tapes; I, however, do not advocate tapes. The main reasons are that high capacity tape drives are mildly expensive, tapes require manual attention, and tapes are fairly vulnerable to environmental conditions. The other issue I have is just that tapes are cumbersome to back up to and restore from. Unless you have really big money to afford a halon equipped, environmentally controlled, vault with your servers serviced by robotic tape changers, you are instead likely to have a person changing the tapes once or twice a day by hand with those tapes stored and treated improperly . What if they forget? What if they get a tape out of sequence when doing a incremental backup? What if they get left next to the server (which they usually do) and are stolen, or burn, or are flooded out with the server? What if I have them taken off site, in a hot car, or in a car that’s stolen? It just is not a practical or safe way to keep your backups. For a lot less money you can do off sight backups to a machine you control using R-Sync and SSH. Or you can outsource to Google, Amazon, Microsoft, Apple, or even Iron Mountain (The last one is exceptional for long term storage of your paper records too)
Personally, for smaller companies, I like one or two run of the mill pc’s with large hard drives sitting in a secure location (preferably a secure co-lo, but if your small enough, perhaps even the CIO’s house) with an encrypted file system (in case of theft) a vpn tunnel to the office network (no external exposure) and lock the box down to just SSH (Side Note 3: Limiting exposure). Then you put a simple script on every computer in the office to back up to a secure spot on one of the servers. I prefer using R-Sync for the back ups and ensuring different logins for each system. The files on the servers are backed up over a secure connection, again using R-Sync, to the off site servers. For the mail and databases, I like to use a combination of offline backups and log transfers (like HADR for DB2) and keep regular copies replicated amongst the office servers. However I will follow this up by moving the offline backups to the off site servers vie SSH. The off site servers will then R-Snapshot after each backup so that you can maintain a history of changes. Don’t worry about moving that much data off site, it happens in the wee hours of the morning and for most set ups you can do this with little to no real bandwidth issues. In my next few post I will show you how to do exactly this; as well as explore some of the variations offered by Windows, Mac, and Linux operating systems.
If you need to perform fast “bare mettle” restores, regularly imaging the system is the only practical method. However this has traditional been difficult to achieve, given the machine needs to be offline for a complete cold image to be performed. To get around this I often employ virtualization. Virtualization is incredibly affordable with a low TOC and high returns. Snapshots of the entire operating system can be taken at regular intervals without adversely affecting operation. Systems can be easily duplicated, split, merged, and managed. Furthermore the images can be made into incredibly compact files for easy back up to an off site repository. The only drawback to virtualization is the use of specialized low level equipment that the virtual server can not emulate. This could be anything from a BrookTrout Fax card to the serial controller designed to operate you tandem mass spec machines. If this is the case I would suggest you start by attempting to divorce what you need have the BMR (bare mettle recovery) from the specialized hardware. Working with the vendor you may find that a client server relationship can be set up, or you may find they offer virtual hardware controllers for popular vendors, like VM, that then operate the real hardware; often at a minimal cost. Either way, if you need to have cold images, virtualization is one of your best avenue to a solution. (Side Note 4: additional value in virtualization).

Redundancy: I would like to point out that live redundancy is your first line of defense and should be incorporated wherever possible. If I have 5 sales offices and I can never miss a call, I probably don't need to dramatically over staff, have spare offices for each, or need to contract with a call center. Instead I just need to make sure my phone system roles any unanswered call going to one office to all other locations. If some one calls the main line of my sales office in Mesa and no one answers the call (perhaps due to emergency evacuation), my phone system rings every other sales office until some one picks up. In addition all sales staff are appropriately trained to handy the customer. While it is much nice for some one to get a representative in they have a relation ship with, any Tempe sales rep can still assist any Mesa customer and when the Mesa reps are back they know their customer called and what the Tempe rep helped them with. The end result is for almost no cost provide a dramatic reduction in missed calls and unhappy customers. Another example close to my hart is front end web services. I would never have a single web server, application server, or database server. Similarly I would not have cold spares waiting around ideal. Instead I would have a variety of machines clustered together with fault tolerance as part of the design core. If one server, or even half the servers fail, no one using the system should ever notice a glitch, even mid transaction. Instead the automated systems should seamlessly fail over to the working systems when the a system becomes unavailable due to a failure or just high load.

Testing
The single most important piece of advice I can give, no matter what your disaster recovery plan is, is to test it. You may back your data up every night, but when was the last time you made sure those back ups worked? I can not tell you the number of IT shops I have worked with that have never tested their back up plan. For that matter I have found that in nearly 2/3rds of every small business I have consulted for their automated nightly backups fail every night. I like to test back up the same way I like to do inventory checks. Once a day/week/month you randomly test a small sub system. Often times you will be forced to do this as people use the system, however controlled tests are better. Each of these periodic test should not last very long and be just enough to ensure that particular aspect of your plan works. Once a year, or maybe once every two years, you should revisit your plan, make changes if necessary, and the run through a complete disaster recovery simulation. This is the only way to know for sure your being adequately protected.


Notes:

Side Note 1:
If you are a medium size business or intend to use this number for more then the most general ball park, I can not over emphasize the need to higher a professional to help you; actuarial adjustment can be quite difficult to master. What I am presenting here is an overly simplistic example for illustration purposes only. However, depending on your insurance policy you may be able to obtain a decent risk adjuster at a discount rate. In addition the implementation of a comprehensive disaster recovery plan may reduce your corporate insurance. This is also true for a continuity plan and a reasonable set of stranded operating procedures. Even a simple employee manual for each department can often win you a decent insurance discount.



Side Note 2:
This entire post was spurred by a discussion with a friend who said that some business would have no issue with data loss, such as the local pizza guy. I firmly disagreed. I believe that any company that has any records that correspond to real or potential money, such as client lists, prepaid expenses, unearned revenue, account receivables, account payables, short or long term liabilities, perhaps a warranty or service agreement with their client, you will suffer greatly when you loose all of that information. But this is a discussion for another time.



Side Note 3:
Please note that limiting external exposure is essential but VPN's are not always convenient. In addition if you have a box where, in case of emergency, you can not get to the physical council you need to have at least SSH exposed. However you can hide the SSH port, or limit its access to specified remote networks, or implement port knocking, or a myriad of other minor security precautions that help add piece of mind.>/A>


Side Note 4:
In addition to offering a great way to image running servers virtualization offers a phenomenal way to reduce your costs and increasing departmental performance. This is obviously a topic for another post, but if you are wrestling with scalability, recovery, constancy, remote management, or security, I recommend looking into virtualization. It is not write for all instances, and using it as a panacea can cause more harm then good, but it can also be a wonderful tool in your IT arsenal.

No comments:

Post a Comment