Nova Resiliency -- Overview (Draft)

There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.

In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.

Resiliency mechanisms within each node (potentially running one or more Nova services)
1. Service failure
  - Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
  - Action (isolation & recovery): restart the service
Rejuvenation
- Event: timer (periodic)
- Action (prevention): (graceful) restart of services
Network failure (in a redundant configuration)
- Even: failure of a network interface, link, or switch
- Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
Network disconnect
- Event: node can not see other nodes
- Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
Cell/pool-wide resiliency mechanisms
1. Service failure
  - Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
  - Action (isolation):
    1. resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
    2. if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
2. Action (recovery):
  1. for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
  2. for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
  3. if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
Resiliency mechanisms for stateful operations
1. keep track of in-progress operations (e.g., using a workflow engine)
2. keep track of success and failure of individual steps
3. when failure detected, apply retires mechanism
4. make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
5. garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed

Nova Resiliency -- Next Level of Details (Draft)

Possible failures

A node failure (hardware failure, OS crash, etc) -- see below

A) node network connectivity failure (adapter, cable, port, etc) -- N/A, assuming redundant network connectivity
1. management network
2. VMs communication network
3. storage network
B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process (see below)
1. compute
2. volume
3. network
4. scheduler
5. api.
C) Fabric component failure -- N/A, assuming redundant/highly available configuration
1. ZK
2. DB
3. RPC
D) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration
1. Glance
2. Keystone

Recovery from (C) service failure:

A watchdog (nova-node-keeper) process will monitor the following failure conditions, and react to recover from the failure:

Steps:

checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
for nodes running nova-compute, checks liveness of libvirtd. If died -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
other conditions TBD

Recovery from (A) node failure:

We want to start the VMs that were running on the failed node on another compute node. There exact recovery procedure depends on the storage copnfiguration:

no shared storage -- new VM is provisioned (same "identity")
shared storage -- the same VM is restarted
boot from volume -- the same VM is restarted