NovaResiliency
Nova Resiliency (Draft)
There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.
In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.
- Resiliency mechanisms within each node (potentially running one or more Nova services)
- Service failure
- Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
- Action (isolation & recovery): restart the service
- Service failure
- Rejuvenation
- Event: timer (periodic)
- Action (prevention): (graceful) restart of services
- Network failure (in a redundant configuration)
- Even: failure of a network interface, link, or switch
- Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
- Network disconnect
- Event: node can not see other nodes
- Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
- Cell/pool-wide resiliency mechanisms
- Service failure
- Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
- Action (isolation):
- resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
- if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
- Action (recovery):
- for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
- for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
- if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
- Service failure
- Resiliency mechanisms for stateful operations
- keep track of in-progress operations (e.g., using a workflow engine)
- keep track of success and failure of individual steps
- when failure detected, apply retires mechanism
- make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
- garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed