Disaster Recovery for OpenStack

Disaster Recovery (DR) for OpenStack is an umbrella topic that describes what needs to be done for applications and services (generally referred to as workload) running in an OpenStack cloud to survive a large scale disaster. Providing DR for a workload is a complex task involving infrastructure, software and an understanding of the workload. To enable recovery following a disaster, the administrator needs to execute a complex set of provisioning operations that will mimic the day-to-day setup in a different environment. Enabling DR for OpenStack hosted workloads requires enablement (APIs) in OpenStack components (e.g., Cinder) and tools which may be outside of OpenStack (e.g., scripts) to invoke, orchestrate and leverage the component specific APIs.

What is Disaster Recovery?

Disaster Recovery is the process of ensuring continuity of a set of workloads following or in advance of a large scale disaster that disrupts the current environment or infrastructure. By large scale disaster, we are considering disasters which can lead to a complete loss of a data center such as floods, tornadoes, hurricanes, fires, etc. To provide DR, we need a geographically distant site which will be the target of recovery. Any resources, data, etc., needed by the application to recover need to be at the target site prior to the disaster.

High Availability versus Disaster Recovery

While both High Availability (HA) and Disaster Recovery strive to achieve continued operations in face of failures, High Availability usually deals with individual components failures, while Disaster Recovery deals with large scale failures.

Some distinguish HA from DR by networking scope - LAN for HA and WAN for DR, in the cloud context a better distinction is probably the autonomy of management. High Availability will be the mechanism for continued operations within a single cloud environment - one deployment of OpenStack in a single location or multiple locations. Disaster Recovery will be the mechanism for continued operations when you have multiple cloud environments - multiple OpenStack deployment in various locations. In this context DR is a continued workload operations in an alternative deployment, the recovery target clouds.

Scope and Scenarios

The goal is to provide a mechanism to mark and protect from disaster applications and services (a set of OpenStack entities) also referred to as a hosted workload. In this context the cloud is the equivalent of the physical hardware, and the recovery process focuses on the application and services, including their data, which are running in the cloud. The mechanism to determine the exact set of VMs, VM images, volumes, etc, to be recover can be based on a tenant, or a per entity mechanism. In its most basic case, it could be a single VM, but can also be all the entities associated with a user.

A separate recovery mechanism, outside the scope of this work, should address making the primary cloud available to run workloads following a disaster. The disaster recovery mechanism for applications and services will handle the fail-back to the primary cloud.

Examples

Application service running on customer cloud and protected by recovery on hosted cloud.
Application service running on customer cloud in data center #1 and protected by recovery on customer data center #2.

The plan is to provide a solution for both the born-in-the-cloud applications, as well as legacy applications that require storage and state.

Is this a new OpenStack project?

Not necessarily. A better description would be an umbrella topic that describes the required APIs and features that OpenStack needs in order to support DR for hosted workloads. Some APIs and features will be integrated into existing projects such as Nova (DR features for compute) and Cinder (Storage replication). Some functionality, like DR orchestration may leverage Heat, or be a new project, or even be outside the scope of OpenStack.

Disaster Recovery is a complex task where different applications and use-cases have different requirements; some use-cases can be easily supported while others may be more complex, this is targeted as a long-term effort with incremental steps.

Vision and Roadmap

Disaster Recovery should include support for:

Capturing the metadata of the cloud management stack, relevant for the protected workloads/resources: either as point-in-time snapshots of the metadata, or as continuous replication of the metadata.
Making available the VM images needed to run the hosted workload on the target cloud.
Replication of the workload data using storage replication, application level replication, or backup/restore.

We note that metadata changes are less frequent than application data changes, and different mechanisms can handle replication of different portions of the metadata and data (volumes, images, etc)

The approach is built around:

Identify required enablement and missing features in OpenStack projects
Create enablement in specific OpenStack projects
Create orchestration scripts to demonstrate DR

When resources to be protected are logically associated with a workload (or a set of inter-related workloads), both the replication and the recovery processes should be able to incorporate hooks to ensure consistency of the replicated data & metadata, as well as to enable customization (automated or manual) of the individual workload components at recovery site. Heat can be used to represent such workloads, as well as to automate the above processes (when applicable).

Design Tenets

The DR is between a primary cloud and a target cloud - independently managed.
The approach should enable a hybrid deployment between private and public cloud.
Note that some of the work related to DR may be relevant to enabling high-availability between regions, availability zones or cells which do share some of the OpenStack services.
Ideally (but not as an immediate step) one of the clouds (primary or target) could be non-OpenStack or even non-cloud bare-metal environments.
The primary and target cloud interact through a “mediator” - a DR middleware or gateway to make sure the clouds are decoupled.
The DR scheme will protect a set of VMs and related resources (VM images, persistent storage, network definitions, metadata, etc). The resources would be typically associated with a workload or a set of workloads owned by a tenant.
Allow flexibility in choice of Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

Disaster Recovery functionality to be supported

Fail-over - switch to recovery site upon failure
Fail-back - switch back to primary site
Test - test application in a sandbox at the recovery site

End goal for Disaster Recovery

Define RPO/RTO objectives
- Defines the replication params (sync/async, bandwidth, etc.)
- Defines DR policy type
Enablement of multiple DR Policy options
- backup to Swift
- Active - Cold standby
- Active - Hot standby
- Active - Active (requires application awareness and support)
- Plugable DR policies - e.g. DR to the cloud
Ability to mark a complete composite application as protected
Ability to elect DR region or availability zone per application
Ability to create one to many DR relationships per application
Ability to scale down the application at the recovery site if needed
Replication of all configuration and metadata required by an application - Neutron, Cinder, Nova, etc.
Ability to ensure consistency of the replicated data & metadata
Supporting a wide range of data replication methods
- Storage systems based replication
- Hypervisor assisted (possibly between heterogeneous storage systems). For example, using DRBD or Qemu based replication
- Backup and Restore methods
- Pluggable application level replication methods
Integration with horizon for basic DR orchestration

Activities

Related sessions in Icehouse summit

Surviving the worst: A vision for OpenStack disaster recovery - November 7, 9:50am
Storage replication (Cinder) - Volume continuous replication

Contacts and (current) team

Ronen Kat (ronenkat) (IBM)
Ayal Baron (abaron) (Red Hat)
Sean Cohen (scohen) (Red Hat)
Alex Glikson (glikson) (IBM)
Avishay Traeger (avishay-il) (IBM)

DisasterRecovery

Contents