TaskFlow

Revised on: 6/1/2017 by Zhenguo Niu

Summary

TaskFlow is a Python library for OpenStack (and other projects) that helps make task execution easy, consistent, scalable and reliable. It allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows) in a declarative manner. It includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted. Projects implemented using this library can enjoy added state resiliency, natural declarative construction, easier testability (since a task does one and only one thing), workflow pluggability, fault tolerance and simplified crash recovery/tolerance (and more).

Conceptual example

This pseudocode illustrates what how a flow would work for those who are familiar with SQL transactions.

START TRANSACTION
   task1: call nova API to launch a server || ROLLBACK
   task2: when task1 finished, call cinder API to attach block storage to the server || ROLLBACK
   ...perform other tasks...
COMMIT

The above flow could be used by Heat (for example) as part of an orchestration to add a server with block storage attached. It may launch several of these in parallel to prepare a number of identical servers (or do other work depending on the desired request).

Why

OpenStack code has grown organically, and does not have a standard and consistent way to perform sequences of code in a way that can be safely resumed or rolled back if the calling process is unexpectedly terminated while the code is busy doing something. Most projects don't even attempt to make tasks restartable, or revertible. There are numerous failure scenarios that are simply skipped and/or recovery scenarios which are not possible in today's code. TaskFlow makes it easy to address these concerns.

Goal: With widespread use of TaskFlow, OpenStack can become very predictable and reliable, even in situations where it's not deployed in high availability configurations.

Example use-cases

Service stop/upgrade/restart (at any time)

A typical issue in the run-time components that OpenStack projects provide is the question of what happens to the daemon (and the state of what the daemon was actively doing) if the daemons are forcibly stopped (this typically happens when software is upgraded, hardware fails, during maintenance windows and for other operational reasons...).

service stop [glance-*, nova-*, quantum...]

Currently many of the OpenStack components do not handle this forced stop in a way that leaves the state of the system in a reconcilable state. Typically the actions that a service was actively doing are immediately forced to stop and can not be resumed and are in a way forgotten (a later scanning process may attempt to clean up these orphaned resources). TaskFlow will help in this situation by tracking the actions, tasks, and their associated states so that when the service is restarted (even after the services software is upgraded) the service can easily resume (or rollback) the tasks that were interrupted when the stop/kill command was triggered. This helps encourage a crash-tolerant architecture, avoids orphaned resources and helps reduce the need for further daemon processes to attempt to do cleanup related work (those daemons typically activate periodically and cause network or disk I/O even if there is no work to do).

Orphaned resources

Due to the lack of transactional semantics many of the OpenStack projects will leave resources in an orphaned state (or in an ERROR state). This is largely unacceptable if OpenStack will be driven by automated systems (for example Heat) which will have no way of analyzing what orphans need to be cleaned up. Taskflow by providing its task oriented model will enable semantics which can be used to correctly track resource modifications. This will allow for all actions done on a resource (or a group of resources) to be undone in an automated fashion; ensuring that no resource is left behind.

Metrics and history

When OpenStack services are structured into task and flow objects and patterns they gain the automatic ability to easily add metric reporting and action history to services that use taskflow by just asking taskflow to record the metrics/history associated with running a task and flow. In the various OpenStack services there are varied ways for accomplishing a similar set of features right now, but by automatically using taskflow those set of varied ways can be unified into a way that will work for all the OpenStack services (and the developer using taskflow does not have to be concerned with how taskflow records this information). This helps decouple the metrics & history associated with running a task and flow code from the actual code that defines the task and flow actions.

Progress/status tracking

In many of the OpenStack projects there is an attempt to show the progress of actions the project is doing. Unfortunately the implementation of this is varied across the various projects, thus leading to inconsistent and not very accurate progress/status reporting and/or tracking. TaskFlow can help here by providing a low-level mechanism whereby it becomes much easier (and simpler) to track progress by letting you plug into TaskFlow's built-in notification system. This avoids having to intrusively add code to actions which is bad since adding code that is not critical to the action makes the action harder to understand, debug, and review. It also makes your progress/status mechanism resilient to future changes by decoupling status/progress tracking from the code that performs the action/s.

Others...

Your use-case here!

Design

Big picture

Structure

Atoms

A atom is the smallest unit in taskflow. A atom acts as the base for other classes in taskflow (to avoid duplicated functionality). And atom is expected to name its desired input values/requirements and name its outputs/provided values as well as its own name and a version (if applicable).

See: atoms for more details.

Tasks

A task (derived from an atom) is the smallest possible unit of work that can have an execute & rollback sequence associated with it.

See: tasks for more details.

Retries

A retry (derived from an atom) is the unit that controls a flow execution. It handles flow failures and can (for example) retry the flow with new parameters.

See: retry for more details.

Flows

A flow is a structure that links one or more tasks together in an ordered sequence. When a flow rolls back, it executes the rollback code for each of its child tasks using whatever reverting mechanism the task has defined as applicable to reverting the logic it applied.

See: flows for more details.

Patterns

Also known as: how you structure your work to be done (via tasks and flows) in a programmatic manner.

Linear: Description: Runs a list of tasks/flows, one after the other in a serial manner.; Constraints: Predecessor tasks' outputs must satisfy successive tasks' inputs.; Use-case: This pattern is useful for structuring tasks/flows that are fairly simple, where a task/flow follows a previous one.; Benefits: Simple.; Drawbacks: Serial, no potential for concurrency.
Unordered: Description: Runs a set of tasks/flows, in any order.; Use-case: This pattern is useful for tasks/flow that are fairly simple and are typically embarrassingly parallel.; Constraints: Disallows intertask dependencies.; Benefits: Simple. Inherently concurrent.; Drawbacks: No dependencies allowed. Tracking and debugging harder due to lack of reliable ordering.
Graph: Description: Runs a graph (set of nodes and edges between those nodes) composed of tasks/flows in dependency driven ordering.; Constraints: Dependency driven, no cycles. A task's dependents are guaranteed to be satisfied before the task will run.; Use-case: This pattern allows for a very high level of potential concurrency and is useful for tasks which can be arranged in a directed acyclic graph (without cycles). Each independent task (or independent subgraph) in the graph could be run in parallel when its dependencies have been satisfied.; Benefits: Allows for complex task ordering. Can be automatically made concurrent by running disjoint tasks/flows in parallel.; Drawbacks: Complex. Can not support cycles. Tracking and debugging harder due to graph dependency traversal.

Note: any combination of the above can be composed together (i.e. it is valid to add a linear pattern into a graph pattern, or vice versa).

See: patterns for more details.

Engines

Also known as: how your tasks get from PENDING to FAILED/SUCCESS. Their purpose is to reliably execute your desired workflows and handle the control & execution flow around making this possible. This makes it so that code using taskflow only has to worry about forming the workflow, and not worry about execution, reverting, or how to resume (and more!).

See: engines for more details.

Jobs

Also known as: how to provide high availability and scalability to your tasks and flows, ensuring that forward progress occurs no matter how many crashes or failures (allowing your workflows to take a licking and keep on ticking) the machines running your workloads endure. This concept makes it so that code using taskflow does not have to worry about distribution or high availability of the contained workflows (one less thing developers have to worry about!).

See: jobs for more details about the job (and associated jobboard) mechanism and concepts.

Conductors

Also known as: the way to plug-and-play the various concepts into a single easy to use runtime unit.

See: conductors for more details.

States

Also known as: the potential state transitions that a flow (and tasks inside of it) can go through.

See: states of task, retry and flow for more details.

Notifications

Also known as: how you can get notified about state transitions, tasks results, task progress, job postings and more...

See: notifications for more details.

Reversion

Both tasks and flows can be reverted by executing the related rollback code on the task object(s).

For example, if a flow asked for a server with a block volume attached, by combining two tasks:

task1: create server || rollback by delete server
task2: create+attach volume || rollback by delete volume

If the attach volume code fails, all tasks in the flow would be reverted using their rollback code, causing both the server and the volume to be deleted.

Persistence

TaskFlow can save current atom states, progress, arguments and results, as well as flows, jobs and more to a database (or any other place), which enables flow and atom resumption, check-pointing and reversion. A persistence API as well as base persistence types are provided with taskflow for the purpose of ensuring that jobs, flows, and there associated atoms can be backed up in a database or in memory (or elsewhere).

See: persistence for more details about why persistence is needed and how to use it.

Checkpointing

A WIP topic/discussion is the concept of check-pointing.

See: checkpointing

Resumption

If a flow is started, but is interrupted before it finishes (perhaps the controlling process was killed) the flow may be safely resumed at its last checkpoint. This allows for safe and easy crash recovery for services. TaskFlow offers different persistence strategies for the checkpoint log, letting you as an application developer pick and choose to fit your application's usage and desired capabilities.

Inputs and Outputs

See: inputs and outputs for more details about the various inputs/outputs/notifications that flow, atom and engine produces/consumes/emits.

Paradigm shifts

See: paradigm shifts to view a few of the changes that may result from programming with (or after using) taskflow.

Best practices

See: best practices for a helpful set of best usages and best practices that are common when using taskflow.

Architectures

See: architectures for a helpful set of *real-world* architectures that are being developed with/using taskflow.

Past & present & future

Past

TaskFlow started as a prototype with the NTTdata corporation along with Yahoo! for Nova and has moved into a more general solution/library that can form the underlying structure of multiple OpenStack projects at once.

Archaeology

See: NovaOrchestration (Fall 2011)

See: StructuredStateManagement (Spring 2013)

See: StructuredWorkflows (Spring 2013)

See: StructuredWorkflowPrimitives (Spring 2013)

See: DistributedTaskManagement (celery inter-op attempt, 2013)

Present

Right now: Currently we are gathering requirements as well as continuing work on taskflow (the library) as well as integrating with various projects.

Active integration

Cinder
Glance
Neutron
Cue
Octavia
Cloud Big Data (closed source)
Pumphouse
Mogan
Your project here!!

Future

Planned/desired/possible... integration

Nova
Heat
Mistral
Manila
Your project here!!

Planned development

We currently track and plan for new features using blueprints and/or using specifications.

Please read over blueprints before contributing.

Examples

Please go here

Core contributors (past and present)

Changbin Liu (AT&T Labs)
Doug Hellmann (HP)
Davanum Srinivas (Mirantis)
Anastasia Karpinska (Grid Dynamics)
Min Pae (HP)
Ivan Melnikov (Grid Dynamics)
Daniel Krause (Rackspace)
Greg Hill (Rackspace)
Jessica Lucci (Rackspace)
Joshua Harlow (Yahoo!)
Rohit Karajgi (NTT Data)
Pranesh Pandurangan (Yahoo!)
You!!

Meeting

Weekly IRC Meeting

Join us!

Launchpad: http://launchpad.net/taskflow

Core: Team

Reviews: Help code review

Contact us!

IRC: You will also find us in #openstack-state-management on freenode

Mailing list: openstack-dev (prefix the subject with [TaskFlow] to get a more immediate response)

Blogs/tutorials/videos/slides

Code

http://git.openstack.org/cgit/openstack/taskflow/ (github mirror)