Disaster Recovery Plan Analysis

By: Mark Bole © 1992, 2004 BIN Computing

Disaster: a sudden or great misfortune; unforeseen mischance bringing with it destruction of life or property.

To plan for disaster recovery first requires identification of possible disasters.  The table below summarizes a description of disasters which threaten us.

Following is an analysis of the recovery actions to be performed for each type of disaster situation.

Disaster Recovery

In any disaster, there are three main functions to be performed:

The first two functions apply generically for every situation.  Individual recovery actions for each type of disaster will then be described.

Discussion of the overall approach: some disaster recovery planning methodologies depend on extremely detailed  write-ups of various steps to follow in a given situation.  While superficially re-assuring, these large documents in practice are costly to maintain,  are not necessarily likely to be available when and where they are needed in an emergency, and in any event still require live, hands-on training!  Given the highly technical and unique nature of every computer application, and APPLICATION in particular, a strong investment in training, rather than rote procedure development, is the approach preferred here.  In other words, we are better off with a significant number of staff who can analyze and creatively respond to any situation, rather than a thick document which describes every situation except the one currently at hand.

Communication

First and foremost, the APPLICATION system operations staff needs to keep one another informed as to current plans, developments, etc., to avoid efforts that are redundant or even at cross-purposes.  Next, other APPLICATION support personnel, plus THE COMPANY network support personnel (if applicable) need to be kept informed.  Finally, non-technical management needs to be apprised of high-level decisions to be made. Methods for communication include face to face, telephone, electronic mail, commercial electronic mail (outside THE COMPANY), fax, etc.

Plan: at present, a partial list of essential contact information (home phone numbers, for example) has been assembled.  A fuller list, including home addresses, email IDs, remote "rendezvous points" (either physical or telecommunicable), will be assembled.  Commercial email accounts will be obtained.  Staff will be reminded regularly to give communications a high priority during any disaster situation.  .

Coordination

In a disaster, it will become necessary very quickly to determine who is "in charge" for the purpose of making risky technical decisions on short notice.  In a crisis, there can be honest differences of opinion; however, to make quick progress, participative decision making may need to temporarily yield to more authoritative styles.

Plan: a "call-out" list will be developed, ranking technical support individuals in order of seniority for the purposes of urgent disaster recovery.  A procedure will be designed so that any individual, not knowing whether others are available, or even if they are aware a disaster exists, will be able to determine in a reasonable time who is the disaster recovery coordinator in charge.

Elements subject to disaster (the "Affects" column):

  1. Server hardware
  2. Network connectivity
  3. Application software and data

Types of Disasters

 

Disaster[1]

Affects

Expected Result:[2]

 

Server hardware

Network connectivity

Application software and data

   

I.      Earthquake, flood, hurricane, war, riot, insurrection, epidemic, labor strike, power failure.
Nature: large scale, natural or man-made, unpredictable and unpreventable.
Scope: geographically close sites

Access blocked; possible physical damage

Unreliable connectivity.

Machine-readable media (tapes) possibly destroyed or unaccessable.

Some or all users cannot use application. Loss of capital equipment.

Estimated duration: between 4 hours and 2 weeks

 

II.    Fire, vandalism, accidental damage, mechanical failure.
Nature:  Small scale, man-made, unpredictable but preventable.
Scope: one part of a single building or one piece of equipment.

Possible physical damage

Unreliable connectivity.

Machine-readable media (tapes) possibly destroyed or unaccessable.

Some or all users cannot use APPLICATION,. Loss of capital equipment.

Estimated duration: between 4 hours and 4 days.

 

III.   "Hackers" with malicious intent, employee alteration or theft of data for personal gain or vengeance, or accidental software bugs (including failures in interface with other systems).
Nature: Unknown scale, unpredictable but preventable.
Scope: all or part of application or data.

Unaffected.

Unreliable connectivity.

Inaccurate or missing business data.  System performance (reponse time) degradation.

Financial loss due to

  bad decisions based on faulty information

  unfair advantage for competitors

  missed payroll, or

  reduction in  employee productivity.

Estimated duration: one minute to several years.

 

IV.   Unavailability of key personnel due to illness, injury, death, kidnapping, or termination of employment.
Nature:  Unknown scale. Unpredictable and unpreventable.
Scope: all or part of APPLICATION.

Delay in maintenance and upgrades.

Delay in maintenance and upgrades.

Delay in maintenance and upgrades.

No immediate affect; increasing chance of financial loss and/or unavailability of APPLICATION depending on duration.

Duration: 1 day to two weeks.

 

V.    Legal, regulatory, or organizational prohibition from using APPLICATION.
Nature: Unknown scale. Partially predictable and preventable.
Scope: all or part of APPLICATION.

Unaffected

Unaffected

Partial or complete lack of access to APPLICATION.

All users cannot use some or all of APPLICATION.
Duration: one day to one year.

 

 

Recovery actions

Type I disaster: The most obvious and best recovery from the first type of disaster involves purchasing and maintaining an "off-site" server capable of running a full or "stripped" copy of APPLICATION.  There is no resource available to create a "stripped" copy of our application, therefore an off-site machine with the same capacity as our production system is the only viable option.  Such a machine would cost several hundred thousand dollars to purchase, and a significant fraction of an FTE to maintain in a ready-to-use condition.

Plan: In the event of a less-severe Type I disaster, there are a few options.  First, off-site (Bay Area only) backup tapes are currently maintained by a comercial data storage service.  These can be used if and when replacement equipment is available to restore the system to an operational state.  Instructions for accessing these tapes will be made part of the communication plan (see above).  Then, negotiations will proceed with our two primary hardware vendors to provide an "emergency spare" type of service, in other words, for a premium payment, they will agree to store off-site, dedicated replacement hardware compatible with our system.  This is similar to purchasing an off-site server (as described above), but is done using expense dollars rather than capital dollars.

To make this plan work in any case requires a great degree of "depth on the bench" for the system operations staff.  To facilitate training for all key personnel, drills involving partial and full system recovery to a machine other than our production servers will be designed and carried out

Finally, although of limited use (and a potential security risk), telephone dial-in access can be provided directly into production servers, possible allowing login access even when physical or network access is unavailable.  While this will not meet the needs of APPLICATION users, it may speed eventual recovery onto another system

Type II disaster: this is very similar to a Type I disaster, only the chances of rapid recovery are greater.  Also, it may be possible (subject to an economic analysis) to keep several spare disk drives off-site (as opposed to an entire server), the idea being that disk drives are one of the most likely sources of hardware failure.

Plan: identical to Type I disaster recovery plan.

Type III disaster: this type of disaster is both easier to handle and harder to detect than the first two.  The worst case is to have an intrusion into the system software that goes on undetected for a long period of time, for even when it is eventually discovered, it is likely to be impossible to identify, let alone recover, what data may have been lost or corrupted.  However, once a software "bug" (accidental or deliberate) is clearly identified, it is normally a fairly straightforward matter to fix it.

This type of disaster also lends itself to a number of prophylactic measures.  Already in place in APPLICATION are both mangement and technical means for keep users honest and accountable, and for detecting and correcting software bugs (including interface failures with other systems).  The management approach involves getting the employee and their immediate supervisor to sign an acknowledgement of their responsibility to treat the application and data as a valuable corporate asset; the technical approach involves such standard activities as requiring password changes, disabling unused accounts, keeping users from accessing parts of the system they have no need to access, etc.  Also, regarding accidental bugs, system test and other "quality assurance" procedures are in place to catch these.  Incremental backups, which allow the system to be restored to its state at many different points in the past, also help provide flexibility to restore date which has been corrupted.

Plan: Continue to enforce and expand managerial and technical controls on who has what types of access levels to APPLICATION.  Continue to improve system testing and other quality assurance procedures.  Mangement input will be required to determine how far to go; in other words, application access security can be tightened to the point of inconvenience to legitmate users.

Type IV disaster:  Straying too far from industry standard software and hardware configurations can cause excessive vulnerability in this area.  Failure to adequately compensate staff is another concern.

Plan:  Create "system documentation" (not the same as the less valuable "disaster recovery manuals" referred to in the "overall approach" section above) to show the general data relationships and flows of our system.  Continue to improve the organization and storage of key source code on the system.  Spend adequate time on cross-training (ideally, this can be 20% of total productive time for a particular group).  Create liaisons to other key technical support groups both inside and outside of THE COMPANY, for mutual assitance and advice.  Provide adequate career paths for staff.  Note: these activities have historically received little management support, and can not be carried out unless sufficient time is allowed.

Type V disaster: although the risk may seem generally low, THE COMPANY as a "deep pockets" organization is a special target for this type of disaster.  For example, some special legal steps were required not long ago to keep us from being sued for breech of contract by a software outfit whose product was originally purchased at the beginning of the APPLICATION project.  Also, vendors may at any time choose to aggresively audit our licensing agreements, looking for violations.

Plan: maintain mangement vigilance over software license agreements.



[1]This column is intended to encompass, in a general sense, virtually all types of "disaster".  The word "preventable" is used in the sense that steps can be taken to lessen the likelihood of occurence, as in "preventable accidents", but it does not indicate that the event can be completely avoided at any reasonable cost.

[2]This column can be used to make a mangement assessment of the potential cost of a disaster, to support decisions around self-insurance, calculated risk-taking, etc.