To ensure business continuity and operational resilience, it's critical to have redundant equipment for your most important systems. This redundant equipment should be located independently from the primary systems at a reasonable distance to avoid a single point of failure. Industry standards provide guidance on the appropriate separation between redundant systems.
Where did this come from?
This control comes from the CSA Cloud Controls Matrix v4.0.10, released on 2023-09-26. You can download the full matrix from the Cloud Security Alliance website at https://cloudsecurityalliance.org/artifacts/cloud-controls-matrix-v4.
The CSA CCM provides a comprehensive set of cloud security controls mapped to various industry standards and regulations. It serves as a great starting point for securing cloud workloads and demonstrating compliance.
For more background on business continuity and disaster recovery in AWS specifically, check out the AWS documentation:
Who should care?
This control is most relevant for:
- IT managers responsible for ensuring the high availability of critical business applications
- Compliance officers needing to demonstrate observance of industry standards around business continuity
- Executives and business owners whose operations depend on IT systems being continuously available
What is the risk?
Without redundant equipment in a geographically separate location, an adverse event impacting the primary site could completely take down critical systems. This could lead to:
- Interruption of essential business functions and loss of productivity
- Damage to brand reputation from downtime of customer-facing services
- Potential loss of data if disaster recovery capabilities are also impacted
- Financial losses and liability from breaching service level agreements
Having redundant equipment as prescribed allows failover to a backup system to maintain availability. It significantly reduces the likelihood of an extended outage from a datacenter-level incident.
What's the care factor?
For any organization with low tolerance for IT outages, this should be considered a high priority control. The cost of downtime in terms of lost business, emergency remediation efforts, and reputational harm will often far exceed the expense of maintaining redundant equipment.
However, not all systems demand the same level of availability. A tiered approach can be used to provide cost-effective resilience based on the criticality of each workload. Focus efforts on the systems where downtime would be most damaging to the business.
When is it relevant?
Scenarios where equipment redundancy is highly relevant include:
- Delivering external services with uptime commitments in SLAs or contracts
- Supporting internal line-of-business applications that employees rely on to do their jobs
- Hosting databases that require high availability and minimal data loss
- Operating in a regulated industry with strict requirements around business continuity
It may be less crucial for:
- Test/dev environments that can tolerate some downtime
- Low criticality departmental applications
- Stateless or easily recreatable workloads
- Organizations with high risk tolerance and low cost of downtime
What are the trade offs?
Maintaining redundant equipment has costs and challenges like:
- Increased capital expenses to procure duplicate hardware
- Operational overhead to keep the failover environment in sync
- Added complexity in the infrastructure architecture and failover procedures
- Potential performance impact from data replication between sites
The business needs to weigh these factors against the benefits and make a risk-based decision on where to apply redundancy. Complete 1:1 duplication is often cost-prohibitive.
How to make it happen?
- Identify business critical workloads that require high availability. Understand the tolerance for downtime and data loss for each.
- Determine locations for redundant equipment. Select at least one geographically separate datacenter or cloud region a reasonable distance from the primary site. Refer to relevant industry standards on minimum separation.
- For each in-scope application, provision equivalent compute, storage, and network resources in the failover location(s). The redundant systems should have sufficient capacity to take over the production load.
- Implement data replication between the primary and secondary equipment. This could involve block storage replication, database mirroring, or application-level synchronization. Ensure consistency of data in failover environment.
- Establish monitoring to detect failure of the primary systems. This may include external health checks of the application endpoints and internal telemetry on system availability.
- Configure automatic failover where possible to minimize downtime. This could utilize load balancer health checks, DNS failover, database mirroring with witness instances, etc.
- For manual failover, document step-by-step runbooks. Identify decision makers empowered to initiate failover. Detail instructions to cut over traffic to backup environment.
- Regularly test failover procedures. Verify monitoring detects outages, failover automation works as expected, and runbooks can be executed successfully. Identify and remediate any gaps.
- Plan fallback procedures to revert traffic to primary systems once an outage is resolved. Have criteria defined to determine when to fail back. Keep redundant environment in sync for future need.
What are some gotchas?
- Adequate network connectivity is needed between the primary and failover environments. Ensure sufficient bandwidth for data replication and capacity to take on production traffic. Consider dedicated, redundant connectivity.
- Keeping the failover environment consistent is critical. Besides data, also replicate application code, configuration, DNS records, certificates, and other dependencies. Utilize automation where possible.
- Some applications may require reconfiguration to support running in the failover environment. For example, updating database connection strings, API endpoints, file paths, etc. Have this well documented.
- RPO (recovery point objective) reflects the allowable data loss when failing over. Replication frequency and snapshot retention impact your effective RPO. Make sure these align with business needs.
- RTO (recovery time objective) reflects how quickly you can failover. Automation, runbook refinement, and regular testing help minimize RTO. Compare against availability SLAs.
What are the alternatives?
- Utilize high availability features within the application stack, such as load balanced web servers, database clustering, serverless functions, etc. This provides redundancy at the component level.
- For cloud hosted workloads, leverage multiple availability zones within a region. This protects against datacenter failures while keeping resources geographically close for synchronous replication and minimal latency.
- Consider active-active architectures where traffic is simultaneously served from multiple environments. While more complex, this avoids a hard failover and provides even better availability.
Explore further
- ISO 22301 provides a framework for implementing business continuity management systems
- NIST SP 800-34 contains guidelines for contingency planning and disaster recovery
For more technical guidance specific to cloud platforms, see: