Highly available transactional task management

Problem

Tasks in Nova such as launching instances are complicated and error prone for the follow reasons:

Each task may have multiple steps, and takes a long time to execute.
A task involves multiple nodes, and is subject to unexpected errors, node failures, and transient network partitions. When an unexpected error occurs (which doesn't happen often), the cleanup procedure is often not well tested. For example, see bug 837687 and 839910.
Multiple tasks can work on the same resources (e.g. a VM, a disk volume), leading to race conditions. For example, a VM might be in migration, and the user click terminate button, or a background arbiter process (see the instance-state-arbiter blueprint) might be trying to fix the inconsistent state.

Currently there is no systematic, reusable way to keep track of the distributed task executions. There is also no mechanism to know which tasks are currently using what resources. Task management is implicitly assumed to be VM state management.

Goal

This blueprint proposes to build a highly available service to offer first-class APIs to task and resource lock management. With the service, an operator has the capabilities of:

monitor the progress of each task, including which step the task is doing, which node it is running on, which resource locks it is holding
automatic atomic rollback: use undo functions to deal with unexpected errors during execution to avoid inconsistent state.
easily retry failed tasks, or abort (kill) tasks. Resource locks should be appropriately acquired and released.
worry-free concurrency control to automatically avoid race conditions and deadlocks among tasks

Design

storage: task state, including resource locks, is hard state, and should be stored in a highly available and lightweight component, such as ZooKeeper. MySQL is probably acceptable but not ideal.
task content:

  - task_id, owner
  - shared and exclusive locks. e.g. ["instance-deadbeef", "volume-1234"]
  - task execution logs for rollback and retry
  - last_updated timestamp
  - state: e.g. [running on node X, waiting on lock Y, aborting, ended]
  - opaque, task-specific information: e.g. imaging, booting, migrating

APIs to change task data structures:

  - task_id = create_task()
  - send_signal(task_id, ...) # kill, terminate, etc
  - aquire_locks( tid, type, resource_objects)
  - release_locks(tid, ..)
  - update_task(tid, ...) # execution log, task specific state
  - end_task()

Implementation

change schedulers to utilize APIs to create, abort, or retry tasks.
change workers (compute, network) to update task state, execution log, and check signals for abort.
change workers by adding undo declarations for ease of automatic atomic rollback.
build command line tools by querying task state for management, in the spirit of "ps, kill, top, lsof, strace"

Target: post essex release

This blueprint is related to

https://blueprints.launchpad.net/nova/+spec/instance-state-arbiter

https://blueprints.launchpad.net/nova/+spec/transaction-orchestration

https://blueprints.launchpad.net/nova/+spec/rpc-improvements

http://etherpad.openstack.org/vmstatemachine

https://blueprints.launchpad.net/nova/+spec/fail-gracefully-on-resource-overcommit