Difference between revisions of "TaskFlow/Engines"

Revision as of 05:48, 14 September 2013

Engine

Engine is what really runs the tasks. It should take flow structure (described by patterns) and use it to decide which task to run and when.

There may be different implementation of engines. Some may be easier to use (like, require no setup) and understand, others might require more complicated setup but provide better scalability. The idea and ideal is that deployers of a service that uses taskflow can select an engine that suites their setup best without modifying code of said service. This allows for starting off using a simpler implementation and scaling out the service that is powered by taskflow as the service grows.

In concept, all engines should implement same interface to make it easy to replace one engine with another, and provide same guaranties on how patterns are interpreted -- for example, if an engine runs a linear flow, the tasks should be run one after another in order.

Possible engines include:

Simple -- just takes e.g. linear flow and runs tasks from it one after another -- should be useful for debugging tasks and simple use cases;
Threaded -- Runs tasks in separate threads enabling them to run in parallel (even several implementations are possible);
Distributed -- loads tasks to celery (or some other external service) that uses tasks dependencies to determine ordering;

Engines might have different capabilities and different configuration but overall the interface should remain the same.

How

Blueprint: https://blueprints.launchpad.net/taskflow/+spec/patterns-and-engines

Storage

Storage is out of scope of the blueprint, but it is still worth to point out its role here.

We already have storage in taskflow -- that's logbook. But it should be emphasized that logbook should become the authoritative, and, preferably, the only source of runtime state information. When task returns result, it should be written directly to logbook. When task or flow state changes in any way, logbook is first to know. Flow should not store task results -- there is logbook for that.

Logbook and a backend are responsible to store the actual data -- these together specify the persistence mechanism (how data is saved and where -- memory, database, whatever), and persistence policy (when data is saved -- every time it changes or at some particular moments or simply never).

@@ Line 1: / Line 1: @@
-== The idea ==
-As of this writing, flow classes (under <code>taskflow.patterns</code>
-package) have several responsibilities.
-'''Describing flow structure''': implicit and explicit dependencies
-between tasks and task ordering are part of flow definition.
-'''Hold runtime data''': task results and states are part of flow
-instances internal states (see e.g.
-<code>taskflow.patterns.linear_flow.Flow.results</code>).
-'''Executing the task''': flow is responsible to select next task(s) to
-run or revert based on flow structure and current state and actually run
-the code.
-It would be nice and cool and actually useful to split the flow into
-three entities so that this responsibilities become separated.
-== Why ==
-If pattern secifies implementation details, it is not a "pattern", but something else.
-We need clear understanding of basic concepts to move forward.
-== Patterns ==
-Pattern is a tool to describe '''structure'''.
-Possible pattern examples are:
-* Linear -- run one task after another;
-* Parallel -- just run all the tasks, in any order or even simultaneously;
-* DAG -- run tasks with dependency-driven ordering, with no cycles;
-* Generic graph -- run tasks with dependency-driven ordering, potentially with cycles;
-* Blocks -- combine all of above into more complicated structure.
-The idea is that graph flow (based on topological sort) and threaded
-flow (work in progress as of this writing, https://review.openstack.org/34488)
-are the '''same flow patterns''': graph is build from task dependencies,
-which is analysed to get task ordering. You can run the same tasks via
-distributed flow on celery, and it will be '''same flow'''.
-Because what's matter is code that runs, and everything else are
-details (though important ones).
-It would be cool to be able to specify how flow is run at runtime or in
-a configuration file: simple stuff for debugging tasks, distributed for
-lagre-scale deployments, etc. This is how we come to...
 == Engine ==
 Engine is what really runs the tasks. It should take flow structure (described by patterns) and use it to decide which task to run and when.
-There may be different implementation of engines. Some may be easyer to use (like, require no setup) and understand, others might require more complicated setup but provide better scalability. The idea is that deployers of a service that uses taskflow can select an engine that suites their setup best without modifying code of said service.
+There may be different implementation of engines. Some may be easier to use (like, require no setup) and understand, others might require more complicated setup but provide better scalability. The idea and ideal is that deployers of a service that uses taskflow can select an engine that suites their setup best without modifying code of said service. This allows for starting off using a simpler  implementation and scaling out the service that is powered by taskflow as the service grows.
-So, all engines should implement same interface to make it easy to replace one engine with another, and provide same guaranties on how patterns are interpreted -- for example, if an engine runs a linear flow, the tasks should be run one after another in order.
+In concept, all engines should implement same interface to make it easy to replace one engine with another, and provide same guaranties on how patterns are interpreted -- for example, if an engine runs a linear flow, the tasks should be run one after another in order.
 Possible engines include:
-* simple -- just takes e.g. linear flow and runs tasks from it one after another -- should be useful for debugging tasks and simple use cases;
+* Simple -- just takes e.g. linear flow and runs tasks from it one after another -- should be useful for debugging tasks and simple use cases;
-* topological -- builds dependency graph from patterns and sorts tasks topologically by it;
+* Threaded -- Runs tasks in separate threads enabling them to run in parallel (even several implementations are possible);
-* threaded -- same as topological, but runs tasks in separate threads enabling them to run in parallel (even several implemetantions are possible);
+* Distributed -- loads tasks to celery (or some other external service) that uses tasks dependencies to determine ordering;
-* distributed -- loads tasks to celery (or some other external service) that uses tasks deps to determine ordering.
-Engines might have different capabilities. For example, topological engine might refuse to interpret generic graph patterns because dependency cycles are error for it, while distributed engine should be fine with it.
+Engines might have different capabilities and different configuration but overall the interface should remain the same.
 == How ==
@@ Line 72: / Line 22: @@
 Storage is out of scope of [https://blueprints.launchpad.net/taskflow/+spec/patterns-and-engines the blueprint], but it is still worth to point out its role here.
-We already have storage in taskflow -- that's logbook. But it should be
+We already have storage in taskflow -- that's logbook. But it should be emphasized that logbook should become the authoritative, and, preferably, the '''only''' source of runtime state
-emphasized that logbook should become the authoritative, and,
+information. When task returns result, it should be written directly to logbook. When task or flow state changes in any way, logbook is first to know. Flow should '''not''' store task results -- there is logbook for that.
-preferably, the '''only''' source of runtime state information. When
-task returns result, it should be written directly to logbook. When task
-or flow state changes in any way, logbook is first to know. Flow
-should '''not''' store task results -- there is logbook for that.
-Logbook is responsible to store the actual data -- it specifies
+Logbook and a backend are responsible to store the actual data -- these together specify the persistence mechanism (how data is saved and where -- memory, database,
-persistence mechanism (how data is saved and where -- memory, database,
+whatever), and persistence policy (when data is saved -- every time it changes or at some particular moments or simply never).
-whatever), and persistence policy (when data is saved -- every time it
-changes or at some particular moments or simply never).