Jump to: navigation, search

NovaResiliency

Nova Resiliency -- Overview (Draft)

There are many possible situations in which failure of an individual OpenStack Nova component may cause unexpected behavior -- starting from failure in performing user's request, and up to irreversible corruption of cloud state and/or data.

In order to make the Nova 'fabric' more resilient, we propose to introduce several 'circles' of resiliency management, each detecting and reacting to potential failure events at different levels.

  • Resiliency mechanisms within each node (potentially running one or more Nova services)
    1. Service failure
      • Event: partial or full failure of a service, detected by a dedicated watchdog mechanism (which could potentially monitor 'liveness' of service-specific aspects)
      • Action (isolation & recovery): restart the service
  • Rejuvenation
    • Event: timer (periodic)
    • Action (prevention): (graceful) restart of services
  • Network failure (in a redundant configuration)
    • Even: failure of a network interface, link, or switch
    • Action: continue uninterrupted, due to redundant configuration. replace failed hardware component, to prevent outage due to repeated failure (note: if requires shutting down the node, evacuate all the running VMs first)
  • Network disconnect
    • Event: node can not see other nodes
    • Action (isolation): shut down services and VMs which are likely to be taken over elsewhere (Note: applicable to 'singleton' services which use Leader Election instrumentation, and for VMs that can be failed over to another node by an HA mechanism)
  • Cell/pool-wide resiliency mechanisms
    1. Service failure
      • Event: failure of a service, detected by heartbeat/membership mechanism (see NovaZooKeeperServiceHeartbeat)
      • Action (isolation):
        1. resiliency agent on the node is approached to perform recover/rejuvenation (e.g., via SSH)
        2. if not successful or the node is unreachable on the network, the node's power is cycled via HW management interface (if accessible). Optionally, fresh image is loaded via PXE.
    2. Action (recovery):
      1. for 'singleton' services, a 'standby' copy is promoted to the 'leader' role using leader election mechanism (see NovaZooKeeperLeaderElection)
      2. for services that depend on the failed service, the dependent services are notified (e.g., scheduler that should stop redirecting requested to a failed compute node)
      3. if the failed service is nova-compute, HA mechanism restarts instances whose storage is accessible from other nodes (e.g., in case of shared storage, boot from volume, etc)
  • Resiliency mechanisms for stateful operations
    1. keep track of in-progress operations (e.g., using a workflow engine)
    2. keep track of success and failure of individual steps
    3. when failure detected, apply retires mechanism
    4. make each stateful component able to detect previously failed attempts and recover (e.g., avoid several attempts to create the same VM)
    5. garbage collection: periodically check the consistency of the distributed state (e.g., data model in the DB versus the actual libvirt configuration on the nodes), apply cleanup when needed

Nova Resiliency -- Next Level of Details (Draft)

Possible failures

A node failure (hardware failure, OS crash, etc) -- see below
  • A) node network connectivity failure (adapter, cable, port, etc) -- N/A, assuming redundant network connectivity
    1. management network
    2. VMs communication network
    3. storage network
  • B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process (see below)
    1. compute
    2. volume
    3. network
    4. scheduler
    5. api.
  • C) Fabric component failure -- N/A, assuming redundant/highly available configuration
    1. ZK
    2. DB
    3. RPC
  • D) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration
    1. Glance
    2. Keystone

Recovery from (C) service failure:

A watchdog (nova-node-keeper) process will monitor the following failure conditions, and react to recover from the failure:

Steps:

  1. checks liveness of nova services which are supposed to run on the node. If died (but should be up according to the DB) -- restart, if stuck -- kill & restart, if unsuccessful -- restart the OS
    • * Note: liveness check should be aware of the service internals. Each service will have an instrumentation to monitor/report its livenss, which would be used by nova-node-keeper. Additionally, mova-node-keeper may apply proactive rejuvenation activities (e.g., periodically restart the services).
  2. for nodes running a service which acts as a proxy to another function (e.g., nova-compute acting as a proxy to a local libvirtd to manage the VM instances), the watchdog will check liveness of that function, and will attempt to restart/recover it if becomes unavailable (e.g., restart libviritd, if stuck -- kill & restart, if unsuccessful -- restart the OS)

Recovery from (A) node failure:

For nodes running nova-compute, we want to start the VMs that were running on the failed node on another compute node (either all of them, or only those specifically labeled as 'HA-enabled'). There exact recovery procedure depends on the storage copnfiguration:

  1. no shared storage -- new VM is provisioned (same "identity")
  2. shared storage -- the same VM is restarted
  3. boot from volume -- the same VM is restarted

Steps:

  1. failure detection (using svcgroup, with timeout that allows the system to 'stabilize' in case of transient failures -- e.g., C.1: nova-compute crashed, restarted by watchdog and re-connected to membership service) -- notifying a new service called "nova-vm-keeper"
  2. verify that the node and the VMs are not accessible on the network (note: should be redundant, but wouldn't hurt to double-check)
  3. find new placement (e.g., using existing placement/scheduler logic)
  4. deploy (for A.1) or re-create (for A.2 and A.3) the instance on the new host (down to libvirt, including DB update, etc)
  5. re-connect volumes/networks (incl. cleanup of the connections to the old host/instance, DB update, etc)
  6. ensure same "identity" and "state" (db entry, IP, credentials, what else?)
  7. try restarting the node via HW management interface (or manually)
  8. when/if the node is back online, ensure that the VMs are not launched again (e.g., make sure nova-node-keeper starts before libvirt, checks in DB whether the instances has been recovered on a different node, cleans up stale VM definitions in libvirt)

For nodes running nova-api or nova-scheduler (as well as nova-volume and nova-network in cases when they act as proxy to another hardware), the extended leader election service (with support for n leaders/active nodes) to spwan another instance of the failed service(s) on other node(s).