TaskFlow/Paradigm shifts

Revised on: 2/9/2014 by Harlowja

⚠ WARNING ⚠

The effects of these paradigm shifts may stay with you throughout your life (you have been warned).

Piece by piece

Mind-blown: atom, task, flow, what are these???

In taskflow, your code is structured in a different way than a typical programmer may be used to (functions, or object oriented + objects). In taskflow in order to have workflows which are easy to introspect and easy to resume from (and to revert in an automated fashion) taskflow introduces its smallest unit, a atom. An atom is in many ways similar to a abstract interface in that a atom specifies its desired input data/requirements and output/provided values as is given a name. A task in taskflow is an atom that has execute()/revert() methods that use the previous requirements to produce some output/provided values and is one of the key derived classes of an atom (with more to come soon). The main difference between a task and a function is that a task explicitly declares its inputs and explicitly declares its output (since it derives from the atom base class) and the task has a identifying name associated with it as well as a potentially associated way to revert what the task has done (if said task produces side-effects). In order to organize these smallest units into something useful the concept of a flow was created, which has similarities to an expected execution flow that your set of tasks will go through to accomplish a goal. Due to the above task declaring its inputs and outputs the ordering can also be inferred (although it does not need to be) which makes it that much simpler to make a group of small tasks accomplish some larger goal.

NOTE: for further details on the tasks and flow structures that are built-in to taskflow please go see more details at the structure overview page.

Resilience

Mind-blown: when ordering my work with flow and tasks and enabling persistence it is possible to resume from a partial completion of those flows and tasks using taskflow?

Yes it is! (one of taskflows key concepts/goals is to bring this functionality to as many OpenStack projects as possible). Resilience for all!

Exceptions

Mind-blown: has my exception logic changed, what does it mean if a task throws an exception, who catches it, what happens???

Exceptions that occur in a task, and which are not caught by the internals of a task will by default currently trigger reversion of the entire workflow that task was in (the engine is responsible for handling this reversion process, as it is also responsible for handling the happy path as well). If multiple tasks in a workflow raise exceptions (say they are executing at the same time via a parallel engine via processes/threads or a distributed engine) then the individual paths that lead to that task will be reverted (if an ancestor task is shared by multiple failing tasks, it will be reverted only once).

NOTE: in the future reversion strategies should be able to make this more customizable (allowing more ways to handle or alter the reversion process so that you can better decide what to do with unhandled exceptions).

Execution flow

Mind-blown: all my tasks belong to engine???

When a set of tasks and associated structure that contains those tasks (aka the flows that create that structure) are given to an engine, along with a possible (but not needed) backend where the engine can store intermediate results (which is needed if the workflow should be able to resume on failure) the engine becomes the execution unit that is responsible for reliably executing the tasks that are contained in the flows that you provide it. That engine will ensure the structure that is provided is retained when executing. For example a linear ordering of tasks by using a linear_flow structure will always be ran in linear order. A set of tasks that are structured in dependency ordering will always be ran in that dependency order. These constraints the engine must adhere to; note other constraints may be enforced by the engine type that is being activated (ie a single threaded engine will only run in a single thread, a distributed or worker based engine may run remotely). So when selecting an engine to use, make sure to carefully select the desired feature set that will work for your application.

Nesting

Mind-blown: without functions (which are now task) how do we model and execute actions that require composition/nesting (no infinite recursion please)???

First, let me describe a little bit why this is hard since it may not be very obvious. In a traditional structure & execution style (without a structured workflow) a function Y may call another function Z and treat what Z does as a blackbox. This type of structure and execution style does not inherently lead to a structure that can be executed by another party (the engine in taskflows case), it also does not easily (without language level features/additions) allow for anyway to resume from the function Z if the program crashes while calling the function Z (and so-on, if Z calls another function this same problem occurs...). This is not to say that carefully designed software can not do this, it just means that they will likely build something like taskflow to solve this problem anyway. To avoid this problem, and enable the features that taskflow creates (resuming, execution control) we need to flip this kind of model on its head (or at least turn it 90 degrees).

The mindshift that taskflow introduces to get around the blackbox problem (Y calling Z, Z calling more functions and so-on) is to change the normal Y->Z structure into a set of dependencies & task inputs and outputs with results being passed between those tasks (in a way similar to message passing). This simple model then allows taskflow (and its engine concept) to be able to restart from a given point by resuming from the last task that has completed. Note that this still makes it difficult to nest tasks. To address this limitation taskflow provides a way to nest tasks and flows. For example, a linear_flow Y' can contain tasks [A, B, C] and then another linear_flow Z' can contain [D, E, Y', F]. This means that the F task listed can depend on all things that Y' (and D, E) have produced before F will start executing (Y' becomes like a blackbox that produces some output, similar in nature to the function Z from above). This kind of composition does not restrict taskflow from resuming, as taskflow internally knows what composes subflows (like Y') and can resume from that nested flow's task if it needs to.

NOTE: coroutines in future versions of python pep-3156 have a similar task like model (not identical, but similar). The issue with coroutines is that they still do not provide the capability to resume, revert or structure your code in a way that maps closely to your actual workflow to be executed. They do though create a base architecture that can be built on to help make this seem easier to accomplish. It is expected that taskflows abstractions should relatively easily map onto python 3.4 which is expected to have a version of pep-3156 once python 3.4 matures.

Control flow

Mind-blown: where did my complex control flow go???

This one is a slight variation in how a programmer normally programs execution control flow. In order to be able to track the execution of your workflow, the desired workflow must be split up into small pieces (in a way similar to functions) ahead of time without a large way to change that execution order at run-time. Taskflow engines using this relatively static structure then can run your structure in a well defined and resumable manner (this relatively static set has been shown to be good enough, by papers & research such as tropic).

This does though currently have a few side-effects in that certain traditional operations (if-then-else, do-while, fork-join, switch...) become more complex in that those types of control flows do not easily map to a representation that can be easily resumed or ran in a potentially distributed manner (since they alter control flow while executing, or create complex and hard to model dependencies between tasks). To keep taskflow relatively minimal (and simple) we have tried to reduce the allowed set to a more manageable and currently smaller set (do the simple things well and add in complexity later, if and when it's needed). If these control flows become valueable then we will revisit if and how we should make these accessible to users of taskflow.

NOTE: inside of a task the execute() method of that task may use whichever existing control flow it desires (any supported by python), but outside of the execute() the set of control flow operators are more minimal (due to the above reasoning/limitations). Another way this can be accomplished is to have the factory function associated with creating your workflow (the method location that is persisted on logbook creation) perform most of the complex control flow (while constructing the needed tasks). For more information about this see the flow factory reference.

Workflow ownership transfer

Mind-blown: I have read that taskflow supports a way to automatically transfer workflows to workers who can complete that work, as well as the ability to resume partially complete work automatically. How is this possible? Is it?

It is possible, and desired that this would be a typical usage pattern. Taskflows jobboard concept acts as a location where work is posted for some selected worker to complete; in a way similar to what a physical/virtual jobboard does. Posting work to a jobboard in taskflow usage allows that work to be picked up by any type of worker watching that jobboard for new work to appear; in a way this is similar to a messaging system, but with notifications of when new messages/work appear. This allows workers to become aware of new work. That is one part of the puzzle, the second part is the ability to atomically claim that work (in other terms, the work will be assigned or received and accepted by the worker, to be completed at some end-date). This is where the difference with a messaging system stop (since a messaging system does not have atomic ownership abilities), but systems like zookeeper or etcd do provides these capabilities using there raft/paxos/zab algorithms (and they also provide enough posting/notification capabilities to provide the above messaging-like functionality). So this is how a worker accepts that work and begins to complete that work. That's all great, but what usually happens in large distributed systems is that a percentage of those workers will die/fail/crash (or other) and the entity requesting that work to be completed should not have to know that this has happened (why do they care). Instead what that entity typically would want is for the work to be resumed by another equivalent worker, this is where the capability of etcd/zookeeper/... provide the ability to release ownership in an atomic manner thus allowing another worker to attempt to resume and complete what the previous failed worker partially (or fully completed but not committed) work.

NOTE: This ties in with having persistence and state transitions that can be resumed (in a repeatable manner) from so that the worker can pickup from the last state transition and attempt to make further forward progress on whatever work was requested without making the entity that request that work aware of any ownership transfer. Amazing right!?!?