TaskFlow
Revised on: 6/27/2013 by Harlowja
Contents
Executive Summary
Taskflow is a Python library for OpenStack that helps make task execution easy, consistent, and reliable. It allows the creation of lightweight task
objects and/or functions that are combined together into flows
(aka: workflows). It includes components for running these flows
in a manner that can be stopped, resumed, and safely reverted. Projects implemented using the Taskflow library enjoy added state resiliency, and fault tolerance. It simplifies crash recovery. Think of it as a way to protect an action, similar to the way transactions protect operations in a RDBMS. If a manager process is terminated while an action was in progress, there is a risk that unprotected code would leave the system in a degraded or inconsistent state. With Taskflow, interrupted actions may be resumed or rolled back automatically when a manager process is resumed.
Using Taskflow to organize actions into lightweight task
objects makes atomic code sequences easily testable. Lightweight tasks
are arranged into flows
(aka: workflows). A flow
facilitates the execution of a defined sequence of ordered tasks
. A flow
is a structure (a set of tasks linked together), so it allows the calling code and the workflow to be disconnected so flows
can be reused. Taskflow provides a few mechanisms for running flows
and lets the developer pick and choose which one will work for their needs.
Conceptual Example
This pseudo code illustrates what how a flow
would work for those who are familiar with SQL transactions.
START TRANSACTION task1: call nova API to launch a server || ROLLBACK task2: when task1 finished, call cinder API to attach block storage to the server || ROLLBACK ...perform other tasks... COMMIT
The above flow
could be used by Heat as part of an orchestration to add a server with block storage attached. It may launch several of these in parallel to prepare a number of identical servers.
Why
OpenStack code has grown organically, and does not have a standard and consistent way to perform sequences of code in a way that can be safely resumed or rolled back if the calling process is unexpectedly terminated while the code is busy doing something. Most projects don't even attempt to make tasks restartable, or revertible. There are numerous failure scenarios that are simply skipped and/or recovery scenarios which are not possible in today's code. Taskflow makes it easy to address these concerns. With widespread use of Taskflow, OpenStack can become very predictable and reliable, even in situations where it's not deployed in high availability configurations.
Design
Key primitives: StructuredWorkflowPrimitives
Tasks
A task
is the smallest possible unit of work that can have a rollback sequence associated with it. It could be as simple as a single API call, or a block of code (although the later is not always preferable since a block of code usually is hard to resume or revert, especially if it contains complicated logic).
Flows
A flow
is a structure that links one or more tasks
together in an ordered sequence. When a flow
rolls back, it executes the rollback code for each of it's child tasks
using whatever reverting mechanism the task has defined as applicable to reverting the logic it applied.
Activation
Distributed
When you want your applications tasks to be performed in a system that is highly available & resilient to individual failure.
Patterns offered:
Traditional
When you want your applications tasks to just run inside your application and still take advantage of the functionality taskflow offers (less resilient).
Patterns offered:
- Linear
- Runs a set of tasks, one after the other. Predecessor tasks may satisfy successive tasks requirements.
- Parallel
- Runs a set of tasks using threads, in parallel. The amount of parallelization is limited by the dependencies between tasks.
Reversion
Both tasks
and flows
can be reverted by executing the related rollback code on the task
object(s).
For example, if a flow
asked for a server with a block volume attached, by combining two tasks:
task1: create server || rollback by delete server task2: create+attach volume || rollback by delete volume
If the attach volume code fails, all tasks in the flow would be reverted using their rollback code, causing both the server and the volume to be deleted.
Resumption
If a flow
is started, but is interrupted before it finishes (perhaps the controlling process was killed) the flow
may be safely resumed at its last checkpoint. This allows for safe and easy crash recovery for services.
Examples
Coming soon!
History
Taskflow started as a prototype with the NTTdata corporation along with Yahoo! for nova and has moved into a more general solution/library that can form the underlying structure of multiple OpenStack projects at once.
Wiki with requirements and more background:
Future
Taskflow is the library needed to build the Convection service. Convection will add a REST API that allows remote execution of tasks
and flows
in a remote container.
Contributors
- Keith Bray (Rackspace)
- Kevin Chen (Rackspace)
- Joshua Harlow (Yahoo!)
- Rohit Karajgi (NTTData)
- Jessica Lucci (Rackspace)
- Adrian Otto (Rackspace)
- Tushar Patil (NTTData)