Difference between revisions of "TaskFlow"

Revision as of 22:18, 6 February 2014

Revised on: 2/6/2014 by Harlowja

Executive Summary

Taskflow is a Python library for OpenStack that helps make task execution easy, consistent, and reliable. It allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows). It includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted. Projects implemented using the this library enjoy added state resiliency, fault tolerance and simplified crash recovery. Think of it as a way to protect an action, similar to the way transactions protect operations in a RDBMS. Typically if a manager process is terminated while an action was in progress, there is a risk that unprotected code would leave the system in a degraded or inconsistent state. With this library, interrupted actions may be resumed or rolled back automatically when a manager process is resumed.

Using Taskflow to organize actions into lightweight task objects also makes atomic code sequences easily testable (since a task does one and only one thing). A flow facilitates the execution of a defined sequence of ordered tasks (and imposes constraints on the way an engine can run those tasks). A flow is a structure (a set of tasks linked together), so it allows the calling code and the workflow to be disconnected so flows can be reused. Taskflow provides a few mechanisms for structuring & running flows and lets the developer pick and choose which one will work for their needs.

Conceptual Example

This pseudo code illustrates what how a flow would work for those who are familiar with SQL transactions.

START TRANSACTION
   task1: call nova API to launch a server || ROLLBACK
   task2: when task1 finished, call cinder API to attach block storage to the server || ROLLBACK
   ...perform other tasks...
COMMIT

The above flow could be used by Heat (for example) as part of an orchestration to add a server with block storage attached. It may launch several of these in parallel to prepare a number of identical servers (or do other work depending on the desired request).

Why

OpenStack code has grown organically, and does not have a standard and consistent way to perform sequences of code in a way that can be safely resumed or rolled back if the calling process is unexpectedly terminated while the code is busy doing something. Most projects don't even attempt to make tasks restartable, or revertible. There are numerous failure scenarios that are simply skipped and/or recovery scenarios which are not possible in today's code. Taskflow makes it easy to address these concerns.

Goal: With widespread use of Taskflow, OpenStack can become very predictable and reliable, even in situations where it's not deployed in high availability configurations.

Further use-cases

Service stop/upgrade/restart (at any time)

A typical issue in the runtime components that OpenStack projects provide is the question of what happens to the daemon (and the state of what the daemon was actively doing) if the daemons are forcibly stopped (this typically happens when software is upgraded, hardware fails, during maintenance windows and for other operational reasons).

service stop [glance-*, nova-*, quantum...]

Currently many of the OpenStack components do not handle this forced stop in a way that leaves the state of the system in a reconcilable state. Typically the actions that a service was actively doing are immediately forced to stop and can not be resumed and are in a way forgotten (a later scanning process may attempt to clean up these orphaned resources).

Note: Taskflow will help in this situation by tracking the actions, tasks, and there associated states so that when the service is restarted (even after the services software is upgraded) the service can easily resume (or rollback) the tasks that were interrupted when the stop command was triggered. This helps avoid orphaned resources and helps reduce the need for further daemon processes to attempt to do cleanup related work (said daemons typically activate periodically and cause network or disk I/O even if there is no work to do).

Orphaned resources

Due to the lack of transactional semantics many of the OpenStack projects will leave resources in an orphaned state (or in an ERROR state). This is largely unacceptable if OpenStack will be driven by automated systems (for example Heat) which will have no way of analyzing what orphans need to be cleaned up. Taskflow by providing its task oriented model will enable semantics which can be used to correctly track resource modifications. This will allow for all actions done on a resource (or a group of resources) to be undone in an automated fashion; ensuring that no resource is left behind.

Metrics and history

When OpenStack services are structured into task and flow objects and patterns they gain the automatic ability to easily add metric reporting and action history to services that use taskflow by just asking taskflow to record the metrics/history associated with running a task and flow. In the various OpenStack services there are varied ways for accomplishing a similar set of features right now, but by automatically using taskflow those set of varied ways can be unified into a way that will work for all the OpenStack services (and the developer using taskflow does not have to be concerned with how taskflow records this information). This helps decouple the metrics & history associated with running a task and flow code from the actual code that defines the task and flow actions.

Progress/status tracking

In many of the OpenStack projects there is an attempt to show the progress of actions the project is doing. Unfortunately the implementation of this is varied across the various projects, thus leading to inconsistent and not very accurate progress/status reporting and/or tracking. Taskflow can help here by providing a low-level mechanism where it becomes much easier (and simpler) to track progress by letting you plug-in to taskflows built-in notification system. This avoids having to intrusively add code to the actions that are being performed to accomplish the same goal. It also makes your progress/status mechanism resilient to future changes by decoupling status/progress tracking from the code that performs the underlying actions.

Design

Key primitives: StructuredWorkflowPrimitives

Ten thousand foot view

Structure

Atoms

A atom is the smallest unit in taskflow. A atom acts as the base for other classes in taskflow (to avoid duplicated functionality). And atom is expected to name its desired input values/requirements and name its outputs/provided values as well as its own name and a version (if applicable).

Tasks

A task (also an atom) is the smallest possible unit of work that can have a execute & rollback sequence associated with it.

Flows

A flow is a structure that links one or more tasks together in an ordered sequence. When a flow rolls back, it executes the rollback code for each of it's child tasks using whatever reverting mechanism the task has defined as applicable to reverting the logic it applied.

Patterns

Also known as: how you structure your work to be done (via tasks and flows) in a programmatic manner.

Linear: Description: Runs a list of tasks/flows, one after the other in a serial manner.; Constraints: Predecessor tasks outputs must satisfy successive tasks inputs.; Use-case: This pattern is useful for structuring tasks/flows that are fairly simple, where a task/flow follows a previous one.; Benefits: Simple.; Drawbacks: Serial, no potential for concurrency.
Unordered: Description: Runs a set of tasks/flows, in any order.; Use-case: This pattern is useful for tasks/flow that are fairly simple and are typically embarrassingly parallel.; Constraints: Disallows intertask dependencies.; Benefits: Simple. Inherently concurrent.; Drawbacks: No dependencies allowed. Ease of tracking and debugging harder due to lack of reliable ordering.
Directed acyclic graph: Description: Runs a graph (set of nodes and edges between those nodes) composed of tasks/flows in dependency driven ordering.; Constraints: Dependency driven, no cycles. A tasks dependents are guaranteed to be satisfied before the task will run.; Use-case: This pattern allows for a very high level of potential concurrency and is useful for tasks which can be arranged in a directed acyclic graph (without cycles). Each independent task (or independent subgraph) in the graph could be ran in parallel when its dependencies have been satisfied.; Benefits: Allows for complex task ordering. Can be automatically made concurrent by running disjoint tasks/flows in parallel.; Drawbacks: Complex. Can not support cycles. Ease of tracking and debugging harder due to graph dependency traversal.

Note: any combination of the above can be composed together (aka, it is valid to add a linear pattern into a graph pattern, or vice versa).

Engines

Also known as: how your tasks get from PENDING to FAILED/SUCCESS. There purpose is to reliably execute your desired workflows and handle the control & execution flow around making this possible. This makes it so that code using taskflow only has to worry about forming the workflow, and not worry about distribution, execution, reverting or worrying about how to resume (and more!).

See: engines for more details.

States

Also known as: the potential state transitions that a flow (and tasks inside of it) can go through.

See: states of task and flow for more details.

Reversion

Both tasks and flows can be reverted by executing the related rollback code on the task object(s).

For example, if a flow asked for a server with a block volume attached, by combining two tasks:

task1: create server || rollback by delete server
task2: create+attach volume || rollback by delete volume

If the attach volume code fails, all tasks in the flow would be reverted using their rollback code, causing both the server and the volume to be deleted.

Persistence

TaskFlow can save current task states, progress, arguments and results, as well as flows, jobs and more to database (or at any other place), which enables flow and task resumption, check-pointing and reversion.

See: persistence for more details about why and how.

Resumption

If a flow is started, but is interrupted before it finishes (perhaps the controlling process was killed) the flow may be safely resumed at its last checkpoint. This allows for safe and easy crash recovery for services. Taskflow offers different persistence strategies for the checkpoint log, letting you as an application developer pick and choose to fit your applications usage and desired capabilities.

Inputs and Outputs

See: inputs and outputs for more details about the various inputs/outputs/notifications that flow, task and engine produces/consumes/emits.

Mind-shifts

Please visit the following [TaskFlow/Mind_shifts|link] to view common mind shifts that occur when using taskflow.

Best practices

See: best practices for a helpful set of best usages and best practices that are common when using taskflow.

Past & Present & Future

Past

Taskflow started as a prototype with the NTTdata corporation along with Yahoo! for Nova and has moved into a more general solution/library that can form the underlying structure of multiple OpenStack projects at once.

See: StructuredStateManagement

See: StructuredWorkflows

Present

Right now: Currently we are gathering requirements as well as continuing work on taskflow (the library) as well as integrating with various projects.

Active Integration

Cinder
Nova
Rally
Mistral
Your project here!!

Future

Planned Integration

Heat
Glance
Neutron (?)
Manila (?)
Your project here!!

Convection

Taskflow is the library needed to build the Convection service.

Convection will add a REST API that allows remote execution of tasks and flows in a remote container.

Examples

Please go here

Contributors (past and present)

Anastasia Karpinska (Grid Dynamics) [Core]
Adrian Otto (Rackspace)
Changbin Liu (AT&T Labs) [Core]
Ivan Melnikov (Grid Dynamics) [Core]
Joshua Harlow (Yahoo!) [PTL] [Core]
Jessica Lucci (Rackspace)
Keith Bray (Rackspace)
Kevin Chen (Rackspace)
Rohit Karajgi (NTTData)
Tushar Patil (NTTData)
Stanislav Kudriashev (Grid Dynamics)
You!!

Meeting

Weekly IRC Meeting

Similar(ish) libraries

Join us!

Launchpad: https://launchpad.net/taskflow

IRC: You will also find us in #openstack-state-management on freenode

Core: Team

Reviews: Help code review

Code

http://github.com/stackforge/taskflow

Releases

Version	Summary	Release Date	File a bug
0.1	Core functionality	10/24/2013	Bugs?
0.1.1	Small bug fixes for 0.1	11/15/2013	Bugs?
0.1.2	Small bug fixes for 0.1.1 Python3 compatibility adjustments Requirements updated to unbind sqlalchemy	1/10/2014	Bugs?
0.1.3	Small bug fixes for 0.1.2 Concurrency adjustments Exception unicode text cleanup	2/6/2014	Bugs?

@@ Line 133: / Line 133: @@
 == Mind-shifts ==
-Using taskflow requires a slight shift in mind-set and changes a little bit of
+Please visit the following [TaskFlow/Mind_shifts|link] to view common mind shifts that occur when using taskflow.
-how your normal code would run and how you typically structure your code while
-programming. The taskflow team has tried to keep the amount of mind-altering
-required to use taskflow to as much of a minimum as possible (since mind-altering
-means learning new concepts, or suppressing existing ones) to make it easy to
-adopt taskflow into your service/application/library. The below are common kinds
-of ''mind-blown'' experiences that may occur when starting to get used to taskflow.
-The effects may stay with your throughout your life (you have been warned).
-=== Piece by piece ===
-'''Mind-blown:''' atom, task, flow, what are these???
-In taskflow, your code is structured in a different way than a typical programmer
-may be used to (functions, or object oriented + objects). In taskflow in order to have
-workflows which are easy to introspect and easy to resume from (and to revert
-in an automated fashion) taskflow introduces its smallest unit, a atom. An atom
-is in many ways similar to a abstract interface in that a atom specifies its
-desired input data/requirements and output/provided values as is given a name.
-A task in taskflow is an atom that has execute()/revert() methods that use the
-previous requirements to produce some output/provided values and is one of the key
-derived classes of an atom (with more to come soon). The main difference between a task and a function
-is that a task explicitly declares its inputs and explicitly declares its output
-(since it derives from the atom base class) and the task has a identifying name
-associated with it as well as a potentially associated way to revert what the
-task has done (if said task produces side-effects). In order to organize these
-smallest units into something useful the concept of a flow was created, which has
-similarities to an expected execution flow that your set of tasks will go through
-to accomplish a goal. Due to the above task declaring its inputs and outputs the
-ordering can also be inferred (although it does not need to be) which makes it
-that much simpler to make a group of small tasks accomplish some larger goal.
-'''NOTE:''' for further details on the tasks and flow structures that are
-built-in to taskflow please go see more details at the [[TaskFlow#Structure | structure]]
-overview page.
-=== Resilience ===
-'''Mind-blown:''' when ordering my work with flow and tasks and enabling
-persistence it is possible to resume from a partial completion of those
-flows and tasks using taskflow?
-Yes it is! (one of taskflows key concepts/goals is to bring this functionality to
-as many OpenStack projects as possible). Resilience for all!
-=== Exceptions ===
-'''Mind-blown:''' has my exception logic changed, what does it mean if a task
-throws an exception, who catches it, what happens???
-Exceptions that occur in a task, and which are not caught by the internals of a
-task will by default currently trigger reversion of the '''entire''' workflow that task was
-in (the engine is responsible for handling this reversion process, as it is
-also responsible for handling the [http://en.wikipedia.org/wiki/Happy_path happy path] as well).
-If multiple tasks in a workflow raise exceptions (say they are executing at the
-same time via a parallel engine via processes/threads or a distributed engine) then the individual
-''paths'' that lead to that task will be reverted (if an ancestor task is shared
-by multiple failing tasks, it will be reverted only once).
-'''NOTE:''' in the future [https://blueprints.launchpad.net/taskflow/+spec/reversion-strategies reversion strategies]
-should be able to make this more customizable (allowing more ways
-to handle or alter the reversion process so that you can better decide what to
-do with unhandled exceptions).
-=== Execution flow ===
-'''Mind-blown:''' all my tasks belong to engine???
-When a set of tasks and associated structure that contains those tasks (aka the
-flows that create that structure) are given to an engine, along with a possible
-(but not needed) backend where the engine can store intermediate results (which
-is needed if the workflow should be able to resume on failure) the engine becomes
-the execution unit that is responsible for reliably executing the tasks that are
-contained in the flows that you provide it. That engine will ensure the structure
-that is provided is retained when executing. For example a linear ordering of tasks
-by using a linear_flow structure will '''always''' be ran in linear order. A
-set of tasks that are structured in dependency ordering will '''always''' be ran
-in that dependency order. These ''constraints'' the engine '''must''' adhere to; note
-other constraints may be enforced by the engine type that is being activated (ie
-a single threaded engine will only run in a single thread, a distributed or worker
-based engine may run remotely). So when selecting an engine to use, make sure
-to carefully select the desired feature set that will work for your application.
-=== Nesting ===
-'''Mind-blown:''' without functions (which are now task) how do we model and
-execute actions that require composition/nesting (no infinite recursion please)???
-First, let me describe a little bit why this is hard since it may not be very
-obvious. In a traditional structure & execution style (without a structured workflow)
-a function Y may call another function Z and treat what Z does as a
-[http://en.wikipedia.org/wiki/Black_box blackbox]. This type of structure and execution
-style does not inherently lead to a structure that can be executed by
-another party (the engine in taskflows case), it also does not
-easily (without language level features/additions) allow for anyway to resume from the
-function Z if the program crashes while calling the function Z (and so-on, if
-Z calls another function this same problem occurs...). This is ''not'' to say that
-carefully designed software can not do this, it just means that they will likely
-build something ''like'' taskflow to solve this problem anyway. To avoid this problem, and
-enable the features that taskflow creates (resuming, execution control) we need
-to flip this kind of model on its head (or at least turn it 90 degrees).
-The mindshift that taskflow introduces to get around the blackbox problem (Y
-calling Z, Z calling more functions and so-on) is to change the normal Y->Z structure
-into a set of dependencies & task inputs and outputs with results being passed
-between those tasks (in a way similar to [http://en.wikipedia.org/wiki/Message_passing message passing]).
-This simple model then allows taskflow (and its engine concept)
-to be able to restart from a given point by resuming from the last task that has completed.
-Note that this still makes it difficult to nest tasks. To address this limitation
-taskflow provides a way to '''nest tasks and flows'''. For example, a linear_flow Y' can contain tasks [A, B, C] and then another linear_flow Z' can contain [D, E, Y', F].
-This means that the F task listed can depend on all things that Y' (and D, E) have produced before F
-will start executing (Y' becomes like a blackbox that produces some output, similar
-in nature to the function Z from above). This kind of composition does not
-restrict taskflow from resuming, as taskflow internally knows what composes subflows (like Y')
-and can resume from that nested flow's task if it needs to.
-'''NOTE:''' coroutines in future versions of python [http://www.python.org/dev/peps/pep-3156/ pep-3156]
-have a similar [http://www.python.org/dev/peps/pep-3156/#tasks task] like
-model (not identical, but similar). The issue with coroutines is that they still do
-not provide the capability to resume, revert or structure your code in a way that
-maps closely to your actual workflow to be executed. They do though create a base
-architecture that can be built on to help make this seem easier to accomplish.
-It is expected that taskflows abstractions  should ''relatively'' easily map
-onto python 3.4 which is expected to have a version of
-[http://www.python.org/dev/peps/pep-3156/ pep-3156] once python 3.4 matures.
-=== Control flow ===
-'''Mind-blown:''' where did my complex control flow go???
-This one is a slight variation in how a programmer normally programs execution
-control flow. In order to be able to track the execution of your workflow, the
-desired workflow must be split up into small pieces (in a way similar to
-functions) '''ahead of time''' without a large way to change that execution
-order at run-time. Taskflow engines using this relatively static structure then
-can run your structure in a well defined and resumable manner (this relatively
-static set has been shown to be ''good'' enough, by papers & research such as
-[http://www.netdb.cis.upenn.edu/papers/tropic_tr.pdf tropic]).
-This does though currently have a few side-effects in that certain traditional
-operations (if-then-else, do-while, fork-join, switch...) become more
-''complex'' in that those types of control flows do not easily map to a
-representation that can be easily resumed or ran in a potentially distributed manner
-(since they alter control flow while executing, or create complex and hard to
-model dependencies between tasks). To keep taskflow relatively minimal (and simple)
-we have tried to reduce the allowed set to a more manageable and currently smaller
-set (do the simple things well and add in complexity later, if and when it's needed).
-If these control flows become valueable then we will revisit if and how we should
-make these accessible to users of taskflow.
-'''NOTE:''' inside of a task the <code>execute()</code> method of that task may
-use whichever existing control flow it desires (any supported by python), but outside of the
-<code>execute()</code> the set of control flow operators are more minimal (due to the above
-reasoning/limitations). Another way this can be accomplished is to have the <code>factory</code>
-function associated with creating your workflow (the method location that is persisted on logbook
-creation) perform most of the complex control flow (while constructing the needed tasks). For more
-information about this see the [[TaskFlow/Patterns_and_Engines/Persistence#Flow_Factory | flow factory]]
-reference.
-=== Workflow ownership transfer ===
-'''Mind-blown:''' I have read that taskflow supports a way to automatically transfer workflows to workers
-who can complete that work, as well as the ability to resume partially complete work automatically. How
-is this possible? Is it?
-It is possible, and desired that this would be a typical usage pattern. Taskflows jobboard concept acts as a
-location where work is posted for some selected worker to
-complete; in a way similar to what a physical/virtual [http://en.wikipedia.org/wiki/Job_board jobboard] does.
-Posting work to a jobboard in taskflow usage allows that work to be picked up by any type of worker watching
-that jobboard for new work to appear; in a way this is similar to a messaging system, but with notifications of
-when new messages/work appear. This allows workers to become aware of new work. That is one part of the
-puzzle, the second part is the ability to atomically claim that work (in other terms, the work will be assigned
-or received and accepted by the worker, to be completed at some end-date). This is where the difference with a
-messaging system stop (since a messaging system does not have atomic ownership abilities), but systems like
-[http://zookeeper.apache.org/ zookeeper] or [https://github.com/coreos/etcd etcd] do provides these capabilities
-using there raft/paxos/zab algorithms (and they also provide enough posting/notification capabilities to provide
-the above messaging-like functionality). So this is how a worker accepts that work and begins to complete that work.
-That's all great, but what usually happens in large distributed systems is that a percentage of those workers will
-die/fail/crash (or other) and the entity requesting that work to be completed '''should not''' have to know that this
-has happened (why do they care). Instead what that entity typically would want is for the work to be resumed by another equivalent
-worker, this is where the capability of etcd/zookeeper/... provide the ability to release ownership in an atomic manner
-thus allowing another worker to attempt to resume and complete what the previous failed worker partially (or fully
-completed but not committed) work.
-'''NOTE:''' This ties in with having persistence and state transitions that can be resumed (in a repeatable
-manner) from so that the worker can pickup from the last state transition and attempt to make further
-forward progress on whatever work was  requested without making the entity that request that work aware
-of any ownership transfer. Amazing right!?!?
 == Best practices ==