Nova Resiliency (Draft)

There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.

In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.

Resiliency mechanisms within each node (potentially running one or more Nova services)
1. Service failure
  - Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
  - Action (isolation & recovery): restart the service
Rejuvenation
- Event: timer (periodic)
- Action (prevention): (graceful) restart of services
Network failure (in a redundant configuration)
- Even: failure of a network interface, link, or switch
- Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
Network disconnect
- Event: node can not see other nodes
- Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
Cell/pool-wide resiliency mechanisms
1. Service failure
  - Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
  - Action (isolation):
    1. resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
    2. if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
2. Action (recovery):
  1. for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
  2. for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
  3. if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
Resiliency mechanisms for stateful operations
1. keep track of in-progress operations (e.g., using a workflow engine)
2. keep track of success and failure of individual steps
3. when failure detected, apply retires mechanism
4. make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
5. garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed

NovaResiliency

Nova Resiliency (Draft)