Equipment Redundancy

Identify business critical workloads that require high availability. Understand the tolerance for downtime and data loss for each.

Determine locations for redundant equipment. Select at least one geographically separate datacenter or cloud region a reasonable distance from the primary site. Refer to relevant industry standards on minimum separation.

For each in-scope application, provision equivalent compute, storage, and network resources in the failover location(s). The redundant systems should have sufficient capacity to take over the production load.

Implement data replication between the primary and secondary equipment. This could involve block storage replication, database mirroring, or application-level synchronization. Ensure consistency of data in failover environment.

Establish monitoring to detect failure of the primary systems. This may include external health checks of the application endpoints and internal telemetry on system availability.

Configure automatic failover where possible to minimize downtime. This could utilize load balancer health checks, DNS failover, database mirroring with witness instances, etc.

For manual failover, document step-by-step runbooks. Identify decision makers empowered to initiate failover. Detail instructions to cut over traffic to backup environment.

Regularly test failover procedures. Verify monitoring detects outages, failover automation works as expected, and runbooks can be executed successfully. Identify and remediate any gaps.

Plan fallback procedures to revert traffic to primary systems once an outage is resolved. Have criteria defined to determine when to fail back. Keep redundant environment in sync for future need.

Where did this come from?

Who should care?

What is the risk?

What's the care factor?

When is it relevant?

What are the trade offs?

How to make it happen?

What are some gotchas?

What are the alternatives?

Explore further

Learn cloud security with our research blog