NovaResiliency
Nova Resiliency -- Overview (Draft)
There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.
In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.
- Resiliency mechanisms within each node (potentially running one or more Nova services)
- Service failure
- Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
- Action (isolation & recovery): restart the service
- Service failure
- Rejuvenation
- Event: timer (periodic)
- Action (prevention): (graceful) restart of services
- Network failure (in a redundant configuration)
- Even: failure of a network interface, link, or switch
- Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
- Network disconnect
- Event: node can not see other nodes
- Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
- Cell/pool-wide resiliency mechanisms
- Service failure
- Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
- Action (isolation):
- resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
- if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
- Action (recovery):
- for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
- for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
- if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
- Service failure
- Resiliency mechanisms for stateful operations
- keep track of in-progress operations (e.g., using a workflow engine)
- keep track of success and failure of individual steps
- when failure detected, apply retires mechanism
- make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
- garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed
Nova Resiliency -- Next Level of Details (Draft)
Possible failures
A node failure (hardware failure, OS crash, etc) -- see below
- A) node network connectivity failure (adapter, cable, port, etc) -- N/A, assuming redundant network connectivity
- management network
- VMs communication network
- storage network
- B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process (see below)
- compute
- volume
- network
- scheduler
- api.
- C) Fabric component failure -- N/A, assuming redundant/highly available configuration
- ZK
- DB
- RPC
- D) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration
- Glance
- Keystone
Recovery from (C) service failure:
A watchdog (nova-node-keeper) process will monitor the following failure conditions, and react to recover from the failure:
Steps:
- checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
- for nodes running nova-compute, checks liveness of libvirtd. If died -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
- other conditions TBD
Recovery from (A) node failure:
We want to start the VMs that were running on the failed node on another compute node. There exact recovery procedure depends on the storage copnfiguration:
- no shared storage -- new VM is provisioned (same "identity")
- shared storage -- the same VM is restarted
- boot from volume -- the same VM is restarted
Steps:
- failure detection (using svcgroup, with timeout that allows the system to 'stabilize' in case of transient failures -- e.g., C.1: nova-compute crashed, restarted by watchdog and re-connected to membership service) -- notifying a new service called "nova-vm-keeper"
- verify that the node and the VMs are not accessible on the network (note: should be redundant, but wouldn't hurt to double-check)
- find new placement (e.g., using existing placement/scheduler logic)
- deploy (for A.1) or re-create (for A.2 and A.3) the instance on the new host (down to libvirt, including DB update, etc)
- re-connect volumes/networks (incl. cleanup of the connections to the old host/instance, DB update, etc)
- ensure same "identity" and "state" (db entry, IP, credentials, what else?)
- try restarting the node via HW management interface (or manually)
- when/if the node is back online, ensure that the VMs are not launched again (e.g., make sure nova-node-keeper starts before libvirt, checks in DB whether the instances has been recovered on a different node, cleans up stale VM definitions in libvirt)