Difference between revisions of "NovaResiliency"
m (Text replace - "__NOTOC__" to "") |
|||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
− | + | ||
= Nova Resiliency -- Overview (Draft) = | = Nova Resiliency -- Overview (Draft) = | ||
Line 63: | Line 63: | ||
'''Steps:''' | '''Steps:''' | ||
# checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS | # checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS | ||
− | # for nodes running nova-compute, | + | #* * Note: liveness check should be aware of the service internals. Each service will have an instrumentation to monitor/report its livenss, which would be used by nova-node-keeper. Additionally, mova-node-keeper may apply proactive rejuvenation activities (e.g., periodically restart the services). |
− | + | # for nodes running a service which acts as a proxy to another function (e.g., nova-compute acting as a proxy to a local libvirtd to manage the VM instances), the watchdog will check liveness of that function, and will attempt to restart/recover it if becomes unavailable (e.g., restart libviritd, if stuck -- kill & restart, if unsuccessful -- restart the OS) | |
== Recovery from (A) node failure: == | == Recovery from (A) node failure: == | ||
− | + | For nodes running nova-compute, we want to start the VMs that were running on the failed node on another compute node (either all of them, or only those specifically labeled as 'HA-enabled'). There exact recovery procedure depends on the storage copnfiguration: | |
# no shared storage -- new VM is provisioned (same "identity") | # no shared storage -- new VM is provisioned (same "identity") | ||
# shared storage -- the same VM is restarted | # shared storage -- the same VM is restarted | ||
Line 81: | Line 81: | ||
# try restarting the node via HW management interface (or manually) | # try restarting the node via HW management interface (or manually) | ||
# when/if the node is back online, ensure that the VMs are not launched again (e.g., make sure nova-node-keeper starts before libvirt, checks in DB whether the instances has been recovered on a different node, cleans up stale VM definitions in libvirt) | # when/if the node is back online, ensure that the VMs are not launched again (e.g., make sure nova-node-keeper starts before libvirt, checks in DB whether the instances has been recovered on a different node, cleans up stale VM definitions in libvirt) | ||
+ | |||
+ | For nodes running nova-api or nova-scheduler (as well as nova-volume and nova-network in cases when they act as proxy to another hardware), the extended leader election service (with support for n leaders/active nodes) to spwan another instance of the failed service(s) on other node(s). |
Latest revision as of 23:30, 17 February 2013
Contents
Nova Resiliency -- Overview (Draft)
There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.
In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.
- Resiliency mechanisms within each node (potentially running one or more Nova services)
- Service failure
- Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
- Action (isolation & recovery): restart the service
- Service failure
- Rejuvenation
- Event: timer (periodic)
- Action (prevention): (graceful) restart of services
- Network failure (in a redundant configuration)
- Even: failure of a network interface, link, or switch
- Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
- Network disconnect
- Event: node can not see other nodes
- Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
- Cell/pool-wide resiliency mechanisms
- Service failure
- Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
- Action (isolation):
- resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
- if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
- Action (recovery):
- for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
- for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
- if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
- Service failure
- Resiliency mechanisms for stateful operations
- keep track of in-progress operations (e.g., using a workflow engine)
- keep track of success and failure of individual steps
- when failure detected, apply retires mechanism
- make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
- garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed
Nova Resiliency -- Next Level of Details (Draft)
Possible failures
A node failure (hardware failure, OS crash, etc) -- see below
- A) node network connectivity failure (adapter, cable, port, etc) -- N/A, assuming redundant network connectivity
- management network
- VMs communication network
- storage network
- B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process (see below)
- compute
- volume
- network
- scheduler
- api.
- C) Fabric component failure -- N/A, assuming redundant/highly available configuration
- ZK
- DB
- RPC
- D) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration
- Glance
- Keystone
Recovery from (C) service failure:
A watchdog (nova-node-keeper) process will monitor the following failure conditions, and react to recover from the failure:
Steps:
- checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
- * Note: liveness check should be aware of the service internals. Each service will have an instrumentation to monitor/report its livenss, which would be used by nova-node-keeper. Additionally, mova-node-keeper may apply proactive rejuvenation activities (e.g., periodically restart the services).
- for nodes running a service which acts as a proxy to another function (e.g., nova-compute acting as a proxy to a local libvirtd to manage the VM instances), the watchdog will check liveness of that function, and will attempt to restart/recover it if becomes unavailable (e.g., restart libviritd, if stuck -- kill & restart, if unsuccessful -- restart the OS)
Recovery from (A) node failure:
For nodes running nova-compute, we want to start the VMs that were running on the failed node on another compute node (either all of them, or only those specifically labeled as 'HA-enabled'). There exact recovery procedure depends on the storage copnfiguration:
- no shared storage -- new VM is provisioned (same "identity")
- shared storage -- the same VM is restarted
- boot from volume -- the same VM is restarted
Steps:
- failure detection (using svcgroup, with timeout that allows the system to 'stabilize' in case of transient failures -- e.g., C.1: nova-compute crashed, restarted by watchdog and re-connected to membership service) -- notifying a new service called "nova-vm-keeper"
- verify that the node and the VMs are not accessible on the network (note: should be redundant, but wouldn't hurt to double-check)
- find new placement (e.g., using existing placement/scheduler logic)
- deploy (for A.1) or re-create (for A.2 and A.3) the instance on the new host (down to libvirt, including DB update, etc)
- re-connect volumes/networks (incl. cleanup of the connections to the old host/instance, DB update, etc)
- ensure same "identity" and "state" (db entry, IP, credentials, what else?)
- try restarting the node via HW management interface (or manually)
- when/if the node is back online, ensure that the VMs are not launched again (e.g., make sure nova-node-keeper starts before libvirt, checks in DB whether the instances has been recovered on a different node, cleans up stale VM definitions in libvirt)
For nodes running nova-api or nova-scheduler (as well as nova-volume and nova-network in cases when they act as proxy to another hardware), the extended leader election service (with support for n leaders/active nodes) to spwan another instance of the failed service(s) on other node(s).