Jump to: navigation, search

TaskFlow

Revised on: 4/13/2014 by Harlowja

Summary

Taskflow is a Python library for OpenStack that helps make task execution easy, consistent, and reliable. It allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows). It includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted. Projects implemented using the this library enjoy added state resiliency, fault tolerance and simplified crash recovery. Think of it as a way to protect an action, similar to the way transactions protect operations in a RDBMS. Typically if a manager process is terminated while an action was in progress, there is a risk that unprotected code would leave the system in a degraded or inconsistent state. With this library, interrupted actions may be resumed or rolled back automatically when a manager process is resumed.

Using Taskflow to organize actions into lightweight task objects also makes atomic code sequences easily testable (since a task does one and only one thing). A flow facilitates the execution of a defined sequence of ordered tasks (and imposes constraints on the way an engine can run those tasks). A flow is a structure (a set of tasks linked together), so it allows the calling code and the workflow to be disconnected so flows can be reused. Taskflow provides a few mechanisms for structuring & running flows and lets the developer pick and choose which one will work for their needs.

Conceptual Example

This pseudo code illustrates what how a flow would work for those who are familiar with SQL transactions.

START TRANSACTION
   task1: call nova API to launch a server || ROLLBACK
   task2: when task1 finished, call cinder API to attach block storage to the server || ROLLBACK
   ...perform other tasks...
COMMIT

The above flow could be used by Heat (for example) as part of an orchestration to add a server with block storage attached. It may launch several of these in parallel to prepare a number of identical servers (or do other work depending on the desired request).

Why

OpenStack code has grown organically, and does not have a standard and consistent way to perform sequences of code in a way that can be safely resumed or rolled back if the calling process is unexpectedly terminated while the code is busy doing something. Most projects don't even attempt to make tasks restartable, or revertible. There are numerous failure scenarios that are simply skipped and/or recovery scenarios which are not possible in today's code. Taskflow makes it easy to address these concerns.

Goal: With widespread use of Taskflow, OpenStack can become very predictable and reliable, even in situations where it's not deployed in high availability configurations.

Further use-cases

Service stop/upgrade/restart (at any time)

A typical issue in the runtime components that OpenStack projects provide is the question of what happens to the daemon (and the state of what the daemon was actively doing) if the daemons are forcibly stopped (this typically happens when software is upgraded, hardware fails, during maintenance windows and for other operational reasons...).

service stop [glance-*, nova-*, quantum...]

Currently many of the OpenStack components do not handle this forced stop in a way that leaves the state of the system in a reconcilable state. Typically the actions that a service was actively doing are immediately forced to stop and can not be resumed and are in a way forgotten (a later scanning process may attempt to clean up these orphaned resources).

Note: Taskflow will help in this situation by tracking the actions, tasks, and there associated states so that when the service is restarted (even after the services software is upgraded) the service can easily resume (or rollback) the tasks that were interrupted when the stop/kill command was triggered. This helps avoid orphaned resources and helps reduce the need for further daemon processes to attempt to do cleanup related work (said daemons typically activate periodically and cause network or disk I/O even if there is no work to do).

Orphaned resources

Due to the lack of transactional semantics many of the OpenStack projects will leave resources in an orphaned state (or in an ERROR state). This is largely unacceptable if OpenStack will be driven by automated systems (for example Heat) which will have no way of analyzing what orphans need to be cleaned up. Taskflow by providing its task oriented model will enable semantics which can be used to correctly track resource modifications. This will allow for all actions done on a resource (or a group of resources) to be undone in an automated fashion; ensuring that no resource is left behind.

Metrics and history

When OpenStack services are structured into task and flow objects and patterns they gain the automatic ability to easily add metric reporting and action history to services that use taskflow by just asking taskflow to record the metrics/history associated with running a task and flow. In the various OpenStack services there are varied ways for accomplishing a similar set of features right now, but by automatically using taskflow those set of varied ways can be unified into a way that will work for all the OpenStack services (and the developer using taskflow does not have to be concerned with how taskflow records this information). This helps decouple the metrics & history associated with running a task and flow code from the actual code that defines the task and flow actions.

Progress/status tracking

In many of the OpenStack projects there is an attempt to show the progress of actions the project is doing. Unfortunately the implementation of this is varied across the various projects, thus leading to inconsistent and not very accurate progress/status reporting and/or tracking. Taskflow can help here by providing a low-level mechanism where it becomes much easier (and simpler) to track progress by letting you plug-in to taskflows built-in notification system. This avoids having to intrusively add code to the actions that are being performed to accomplish the same goal. It also makes your progress/status mechanism resilient to future changes by decoupling status/progress tracking from the code that performs the underlying actions.

Design

Key primitives: StructuredWorkflowPrimitives

Ten thousand foot view

Steps.png


Structure

Atoms

Flow2.png

A atom is the smallest unit in taskflow. A atom acts as the base for other classes in taskflow (to avoid duplicated functionality). And atom is expected to name its desired input values/requirements and name its outputs/provided values as well as its own name and a version (if applicable).

Tasks

A task (derived from an atom) is the smallest possible unit of work that can have a execute & rollback sequence associated with it.

Retries

A retry (derived from an atom) is the unit that controls a flow execution. It handles flow failures and can retry the flow with new parameters.

See: Retry for more details.

Flows

A flow is a structure that links one or more tasks together in an ordered sequence. When a flow rolls back, it executes the rollback code for each of it's child tasks using whatever reverting mechanism the task has defined as applicable to reverting the logic it applied.

Patterns

Also known as: how you structure your work to be done (via tasks and flows) in a programmatic manner.

Linear
Description: Runs a list of tasks/flows, one after the other in a serial manner.
Constraints: Predecessor tasks outputs must satisfy successive tasks inputs.
Use-case: This pattern is useful for structuring tasks/flows that are fairly simple, where a task/flow follows a previous one.
Benefits: Simple.
Drawbacks: Serial, no potential for concurrency.
Unordered
Description: Runs a set of tasks/flows, in any order.
Use-case: This pattern is useful for tasks/flow that are fairly simple and are typically embarrassingly parallel.
Constraints: Disallows intertask dependencies.
Benefits: Simple. Inherently concurrent.
Drawbacks: No dependencies allowed. Ease of tracking and debugging harder due to lack of reliable ordering.
Directed acyclic graph
Description: Runs a graph (set of nodes and edges between those nodes) composed of tasks/flows in dependency driven ordering.
Constraints: Dependency driven, no cycles. A tasks dependents are guaranteed to be satisfied before the task will run.
Use-case: This pattern allows for a very high level of potential concurrency and is useful for tasks which can be arranged in a directed acyclic graph (without cycles). Each independent task (or independent subgraph) in the graph could be ran in parallel when its dependencies have been satisfied.
Benefits: Allows for complex task ordering. Can be automatically made concurrent by running disjoint tasks/flows in parallel.
Drawbacks: Complex. Can not support cycles. Ease of tracking and debugging harder due to graph dependency traversal.

Note: any combination of the above can be composed together (aka, it is valid to add a linear pattern into a graph pattern, or vice versa).

Engines

Also known as: how your tasks get from PENDING to FAILED/SUCCESS. There purpose is to reliably execute your desired workflows and handle the control & execution flow around making this possible. This makes it so that code using taskflow only has to worry about forming the workflow, and not worry about distribution, execution, reverting or worrying about how to resume (and more!).

See: engines for more details.

States

Also known as: the potential state transitions that a flow (and tasks inside of it) can go through.

See: developers documentation states of task, retry and flow for more details.

Reversion

Both tasks and flows can be reverted by executing the related rollback code on the task object(s).

For example, if a flow asked for a server with a block volume attached, by combining two tasks:

task1: create server || rollback by delete server
task2: create+attach volume || rollback by delete volume

If the attach volume code fails, all tasks in the flow would be reverted using their rollback code, causing both the server and the volume to be deleted.

Persistence

TaskFlow can save current task states, progress, arguments and results, as well as flows, jobs and more to database (or at any other place), which enables flow and task resumption, check-pointing and reversion.

See: persistence for more details about why and how.

Resumption

If a flow is started, but is interrupted before it finishes (perhaps the controlling process was killed) the flow may be safely resumed at its last checkpoint. This allows for safe and easy crash recovery for services. Taskflow offers different persistence strategies for the checkpoint log, letting you as an application developer pick and choose to fit your applications usage and desired capabilities.

Inputs and Outputs

See: inputs and outputs for more details about the various inputs/outputs/notifications that flow, atom and engine produces/consumes/emits.

Paradigm shifts

See: paradigm shifts to view a few of the changes that may result from programming with (or after using) taskflow.

Best practices

See: best practices for a helpful set of best usages and best practices that are common when using taskflow.

Past & Present & Future

Past

Taskflow started as a prototype with the NTTdata corporation along with Yahoo! for Nova and has moved into a more general solution/library that can form the underlying structure of multiple OpenStack projects at once.

See: StructuredStateManagement

See: StructuredWorkflows

Present

Right now: Currently we are gathering requirements as well as continuing work on taskflow (the library) as well as integrating with various projects.

Active Integration

Future

Planned Integration

Convection

Taskflow is the library needed to build the Convection service.

  • Convection will add a REST API that allows remote execution of tasks and flows in a remote container.

Examples

Please go here

Contributors (past and present)

  • Anastasia Karpinska (Grid Dynamics) [Core]
  • Alexander Gorodnev (Grid Dynamics)
  • Adrian Otto (Rackspace)
  • Changbin Liu (AT&T Labs) [Core]
  • Ivan Melnikov (Grid Dynamics) [Core]
  • Joshua Harlow (Yahoo!) [PTL] [Core]
  • Doug Hellmann (Dreamhost) [PTL] [Core]
  • Jessica Lucci (Rackspace)
  • Keith Bray (Rackspace)
  • Kevin Chen (Rackspace)
  • Rohit Karajgi (NTTData)
  • Tushar Patil (NTTData)
  • Stanislav Kudriashev (Grid Dynamics)
  • You!!

Meeting

Similar libraries

Similar languages

Inspiring papers

Join us!

Launchpad: http://launchpad.net/taskflow

IRC: You will also find us in #openstack-state-management on freenode

Core: Team

Reviews: Help code review

Code

Developer docs

Releases

Version Summary Date Notes File a Bug
0.1 Core functionality 10/24/2013 Bugs?
0.1.1 Small bug fixes for 0.1 11/15/2013 Bugs?
0.1.2
  • Small bug fixes for 0.1.1
  • Python3 compatibility adjustments
  • Requirements updated to unbind sqlalchemy
1/10/2014 Bugs?
0.1.3
  • Small bug fixes for 0.1.2
  • Concurrency adjustments
  • Exception unicode text cleanup
2/6/2014 Notes Bugs?
0.2
  • Lots of great changes!
4/1/2014 Notes Bugs?