Revised on: 4/13/2014 by Harlowja
- 1 Summary
- 2 Why
- 3 Design
- 4 Paradigm shifts
- 5 Best practices
- 6 Past & Present & Future
- 7 Examples
- 8 Contributors (past and present)
- 9 Meeting
- 10 Similar libraries
- 11 Similar languages
- 12 Inspiring papers
- 13 Join us!
- 14 Code
Taskflow is a Python library for OpenStack that helps make task execution easy, consistent, and reliable. It allows the creation of lightweight
task objects and/or functions that are combined together into
flows (aka: workflows). It includes
engines for running these
flows in a manner that can be stopped, resumed, and safely reverted. Projects implemented using the this library enjoy added state resiliency, fault tolerance and simplified crash recovery. Think of it as a way to protect an action, similar to the way transactions protect operations in a RDBMS. Typically if a manager process is terminated while an action was in progress, there is a risk that unprotected code would leave the system in a degraded or inconsistent state. With this library, interrupted actions may be resumed or rolled back automatically when a manager process is resumed.
Using Taskflow to organize actions into lightweight
task objects also makes atomic code sequences easily testable (since a
task does one and only one thing). A
flow facilitates the execution of a defined sequence of ordered
tasks (and imposes constraints on the way an
engine can run those tasks). A
flow is a structure (a set of tasks linked together), so it allows the calling code and the workflow to be disconnected so
flows can be reused. Taskflow provides a few mechanisms for structuring & running
flows and lets the developer pick and choose which one will work for their needs.
This pseudo code illustrates what how a
flow would work for those who are familiar with SQL transactions.
START TRANSACTION task1: call nova API to launch a server || ROLLBACK task2: when task1 finished, call cinder API to attach block storage to the server || ROLLBACK ...perform other tasks... COMMIT
flow could be used by Heat (for example) as part of an orchestration to add a server with block storage attached. It may launch several of these in parallel to prepare a number of identical servers (or do other work depending on the desired request).
OpenStack code has grown organically, and does not have a standard and consistent way to perform sequences of code in a way that can be safely resumed or rolled back if the calling process is unexpectedly terminated while the code is busy doing something. Most projects don't even attempt to make tasks restartable, or revertible. There are numerous failure scenarios that are simply skipped and/or recovery scenarios which are not possible in today's code. Taskflow makes it easy to address these concerns.
Goal: With widespread use of Taskflow, OpenStack can become very predictable and reliable, even in situations where it's not deployed in high availability configurations.
Service stop/upgrade/restart (at any time)
A typical issue in the runtime components that OpenStack projects provide is the question of what happens to the daemon (and the state of what the daemon was actively doing) if the daemons are forcibly stopped (this typically happens when software is upgraded, hardware fails, during maintenance windows and for other operational reasons...).
service stop [glance-*, nova-*, quantum...]
Currently many of the OpenStack components do not handle this forced stop in a way that leaves the state of the system in a reconcilable state. Typically the actions that a service was actively doing are immediately forced to stop and can not be resumed and are in a way forgotten (a later scanning process may attempt to clean up these orphaned resources).
Note: Taskflow will help in this situation by tracking the actions, tasks, and there associated states so that when the service is restarted (even after the services software is upgraded) the service can easily resume (or rollback) the tasks that were interrupted when the stop/kill command was triggered. This helps avoid orphaned resources and helps reduce the need for further daemon processes to attempt to do cleanup related work (said daemons typically activate periodically and cause network or disk I/O even if there is no work to do).
Due to the lack of transactional semantics many of the OpenStack projects will leave resources in an orphaned state (or in an
ERROR state). This is largely unacceptable if OpenStack will be driven by automated systems (for example Heat) which will have no way of analyzing what orphans need to be cleaned up. Taskflow by providing its task oriented model will enable semantics which can be used to correctly track resource modifications. This will allow for all actions done on a resource (or a group of resources) to be undone in an automated fashion; ensuring that no resource is left behind.
Metrics and history
When OpenStack services are structured into
flow objects and patterns they gain the automatic ability to easily add metric reporting and action history to services that use taskflow by just asking taskflow to record the metrics/history associated with running a
flow. In the various OpenStack services there are varied ways for accomplishing a similar set of features right now, but by automatically using taskflow those set of varied ways can be unified into a way that will work for all the OpenStack services (and the developer using taskflow does not have to be concerned with how taskflow records this information). This helps decouple the metrics & history associated with running a
flow code from the actual code that defines the
In many of the OpenStack projects there is an attempt to show the progress of actions the project is doing. Unfortunately the implementation of this is varied across the various projects, thus leading to inconsistent and not very accurate progress/status reporting and/or tracking. Taskflow can help here by providing a low-level mechanism where it becomes much easier (and simpler) to track progress by letting you plug-in to taskflows built-in notification system. This avoids having to intrusively add code to the actions that are being performed to accomplish the same goal. It also makes your progress/status mechanism resilient to future changes by decoupling status/progress tracking from the code that performs the underlying actions.
Key primitives: StructuredWorkflowPrimitives
Ten thousand foot view
atom is the smallest unit in taskflow. A atom acts as the base for other classes in taskflow (to avoid duplicated functionality). And atom is expected to name its desired input values/requirements and name its outputs/provided values as well as its own name and a version (if applicable).
task (derived from an atom) is the smallest possible unit of work that can have a execute & rollback sequence associated with it.
retry (derived from an atom) is the unit that controls a flow execution. It handles flow failures and can retry the flow with new parameters.
See: Retry for more details.
flow is a structure that links one or more
tasks together in an ordered sequence. When a
flow rolls back, it executes the rollback code for each of it's child
tasks using whatever reverting mechanism the task has defined as applicable to reverting the logic it applied.
Also known as: how you structure your work to be done (via tasks and flows) in a programmatic manner.
- Description: Runs a list of tasks/flows, one after the other in a serial manner.
- Constraints: Predecessor tasks outputs must satisfy successive tasks inputs.
- Use-case: This pattern is useful for structuring tasks/flows that are fairly simple, where a task/flow follows a previous one.
- Benefits: Simple.
- Drawbacks: Serial, no potential for concurrency.
- Description: Runs a set of tasks/flows, in any order.
- Use-case: This pattern is useful for tasks/flow that are fairly simple and are typically embarrassingly parallel.
- Constraints: Disallows intertask dependencies.
- Benefits: Simple. Inherently concurrent.
- Drawbacks: No dependencies allowed. Ease of tracking and debugging harder due to lack of reliable ordering.
- Directed acyclic graph
- Description: Runs a graph (set of nodes and edges between those nodes) composed of tasks/flows in dependency driven ordering.
- Constraints: Dependency driven, no cycles. A tasks dependents are guaranteed to be satisfied before the task will run.
- Use-case: This pattern allows for a very high level of potential concurrency and is useful for tasks which can be arranged in a directed acyclic graph (without cycles). Each independent task (or independent subgraph) in the graph could be ran in parallel when its dependencies have been satisfied.
- Benefits: Allows for complex task ordering. Can be automatically made concurrent by running disjoint tasks/flows in parallel.
- Drawbacks: Complex. Can not support cycles. Ease of tracking and debugging harder due to graph dependency traversal.
Note: any combination of the above can be composed together (aka, it is valid to add a linear pattern into a graph pattern, or vice versa).
Also known as: how your tasks get from PENDING to FAILED/SUCCESS. There purpose is to reliably execute your desired workflows and handle the control & execution flow around making this possible. This makes it so that code using taskflow only has to worry about forming the workflow, and not worry about distribution, execution, reverting or worrying about how to resume (and more!).
See: engines for more details.
Also known as: the potential state transitions that a flow (and tasks inside of it) can go through.
See: developers documentation states of task, retry and flow for more details.
flows can be reverted by executing the related rollback code on the
For example, if a
flow asked for a server with a block volume attached, by combining two tasks:
task1: create server || rollback by delete server task2: create+attach volume || rollback by delete volume
If the attach volume code fails, all tasks in the flow would be reverted using their rollback code, causing both the server and the volume to be deleted.
TaskFlow can save current
task states, progress, arguments and results, as well as flows, jobs and more to database (or at any other place), which enables
task resumption, check-pointing and reversion.
See: persistence for more details about why and how.
flow is started, but is interrupted before it finishes (perhaps the controlling process was killed) the
flow may be safely resumed at its last checkpoint. This allows for safe and easy crash recovery for services. Taskflow offers different persistence strategies for the checkpoint log, letting you as an application developer pick and choose to fit your applications usage and desired capabilities.
Inputs and Outputs
See: inputs and outputs for more details about the various inputs/outputs/notifications that
See: paradigm shifts to view a few of the changes that may result from programming with (or after using) taskflow.
See: best practices for a helpful set of best usages and best practices that are common when using taskflow.
Past & Present & Future
Taskflow started as a prototype with the NTTdata corporation along with Yahoo! for Nova and has moved into a more general solution/library that can form the underlying structure of multiple OpenStack projects at once.
Right now: Currently we are gathering requirements as well as continuing work on taskflow (the library) as well as integrating with various projects.
Taskflow is the library needed to build the Convection service.
- Convection will add a REST API that allows remote execution of
flowsin a remote container.
Please go here
Contributors (past and present)
- Anastasia Karpinska (Grid Dynamics) [Core]
- Alexander Gorodnev (Grid Dynamics)
- Adrian Otto (Rackspace)
- Changbin Liu (AT&T Labs) [Core]
- Ivan Melnikov (Grid Dynamics) [Core]
- Joshua Harlow (Yahoo!) [PTL] [Core]
- Doug Hellmann (Dreamhost) [PTL] [Core]
- Jessica Lucci (Rackspace)
- Keith Bray (Rackspace)
- Kevin Chen (Rackspace)
- Rohit Karajgi (NTTData)
- Tushar Patil (NTTData)
- Stanislav Kudriashev (Grid Dynamics)
IRC: You will also find us in
#openstack-state-management on freenode
Reviews: Help code review
|Version||Summary||Date||Notes||File a Bug|
|0.1.1||Small bug fixes for 0.1||11/15/2013||Bugs?|