NovaResiliency

= Nova Resiliency -- Overview (Draft) =

There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.

In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.


 * Resiliency mechanisms within each node (potentially running one or more Nova services)
 * Service failure
 * Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
 * Action (isolation & recovery): restart the service
 * Rejuvenation
 * Event: timer (periodic)
 * Action (prevention): (graceful) restart of services
 * Network failure (in a redundant configuration)
 * Even: failure of a network interface, link, or switch
 * Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
 * Network disconnect
 * Event: node can not see other nodes
 * Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
 * Cell/pool-wide resiliency mechanisms
 * Service failure
 * Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
 * Action (isolation):
 * resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
 * if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
 * Action (recovery):
 * for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
 * for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
 * if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
 * Resiliency mechanisms for stateful operations
 * keep track of in-progress operations (e.g., using a workflow engine)
 * keep track of success and failure of individual steps
 * when failure detected, apply retires mechanism
 * make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
 * garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed

= Nova Resiliency -- Next Level of Details (Draft) =

Possible failures
A node failure (hardware failure, OS crash, etc) -- see below
 * A) node network connectivity failure (adapter, cable, port, etc) -- N/A, assuming redundant network connectivity
 * management network
 * VMs communication network
 * storage network
 * B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process (see below)
 * compute
 * volume
 * network
 * scheduler
 * api.
 * C) Fabric component failure -- N/A, assuming redundant/highly available configuration
 * ZK
 * DB
 * RPC
 * D) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration
 * Glance
 * Keystone

Recovery from (C) service failure:
A watchdog (nova-node-keeper) process will monitor the following failure conditions, and react to recover from the failure:

Steps:
 * 1) checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
 * 2) * * Note: liveness check should be aware of the service internals. Each service will have an instrumentation to monitor/report its livenss, which would be used by nova-node-keeper. Additionally, mova-node-keeper may apply proactive rejuvenation activities (e.g., periodically restart the services).
 * 3) for nodes running a service which acts as a proxy to another function (e.g., nova-compute acting as a proxy to a local libvirtd to manage the VM instances), the watchdog will check liveness of that function, and will attempt to restart/recover it if becomes unavailable (e.g., restart libviritd, if stuck -- kill & restart, if unsuccessful -- restart the OS)

Recovery from (A) node failure:
For nodes running nova-compute, we want to start the VMs that were running on the failed node on another compute node (either all of them, or only those specifically labeled as 'HA-enabled'). There exact recovery procedure depends on the storage copnfiguration:
 * 1) no shared storage -- new VM is provisioned (same "identity")
 * 2) shared storage -- the same VM is restarted
 * 3) boot from volume -- the same VM is restarted

Steps:
 * 1) failure detection (using svcgroup, with timeout that allows the system to 'stabilize' in case of transient failures -- e.g., C.1: nova-compute crashed, restarted by watchdog and re-connected to membership service) -- notifying a new service called "nova-vm-keeper"
 * 2) verify that the node and the VMs are not accessible on the network (note: should be redundant, but wouldn't hurt to double-check)
 * 3) find new placement (e.g., using existing placement/scheduler logic)
 * 4) deploy (for A.1) or re-create (for A.2 and A.3) the instance on the new host (down to libvirt, including DB update, etc)
 * 5) re-connect volumes/networks (incl. cleanup of the connections to the old host/instance, DB update, etc)
 * 6) ensure same "identity" and "state" (db entry, IP, credentials, what else?)
 * 7) try restarting the node via HW management interface (or manually)
 * 8) when/if the node is back online, ensure that the VMs are not launched again (e.g., make sure nova-node-keeper starts before libvirt, checks in DB whether the instances has been recovered on a different node, cleans up stale VM definitions in libvirt)

For nodes running nova-api or nova-scheduler (as well as nova-volume and nova-network in cases when they act as proxy to another hardware), the extended leader election service (with support for n leaders/active nodes) to spwan another instance of the failed service(s) on other node(s).