Difference between revisions of "TaskFlow/Checkpointing"

Revision as of 15:47, 23 October 2013

Checkpointing is a nice idea we might want to see implemented in taskflow in future.

Checkpoint identifies a state of a flow, or, so to say, marks a point in a flow. Checkpoints are associated with particular point of flow graph, before the particular task, or after the particular task, or "between" certain tasks or subflows. Checkpoints are part of flow definition.

Checkpoints may have many uses, including:

manage how flow is run
- tell the engine to stop at particular checkpoint
- revert the flow to particular checkpoint
- revert the flow to particular checkpoint and then retry
manage how state (history) is saved
- associate data with a checkpoint
- discard history up to particular checkpoint

Reversion Strategies

Blueprint: bp:reversion-strategies

In simplest case, when error occurs, the flow should be reverted to initial state, as if it was never run. But other cases exist:

one want to revert several tasks, and then try to run them again;
- maybe right now, to take advantage of some kind of HA setup
- maybe later, to give operator chance to fix things before retry
for long-running flows (like distributed flows running some actions periodically) it makes no sense to revert them all; so, the flow should be reverted to some checkpoint (e.g. last one), and then considered to be reverted.

This (and possibly other) reverting policies should be specified at flow definition. Checkpoints might be the tool for it.

Go to Checkpoint

We can go farther in providing manual or, more generally, external control of a flow. One of possible ideas how we do that is allow user (or any other external entity) to order engine to get to particular checkpoint in the flow. Then, some tasks are executed, or some other tasks are reverted, or both. When checkpoint is reached the engine stops leaving flow in interrupted state.

One of possible use cases of that is Anvil. The whole process from bootstrapping to installed and running (and/or tested) OpenStack can be represented as one great flow, with anvil actions just directing an internal engine to go to particular checkpoint.

Discarding State

Blueprints: bp:checkpointing bp:book-retention

For long-running flows keeping all the history might become a problem, in a way similar to what heat has hit. While the problem scale is smaller for taskflow (at least in short term), for infinitely running flows like ones running periodic tasks some measures should be taken.

Checkpoints might be helpful there. As Jessica Lucci pointed out:

20:51:16 <jlucci> I'm not sure if it's feasible, but the idea is that once you get to a checkpoint,
20:51:26 <jlucci> you can sort of consolidate the data up to that point
20:51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that

Discarding the state may be subject to policy. Checkpoint is natural place to attach such policy to. For example, user might want to save the whole history for last 5 checkpoints, or checkpoints for last 3 days. When flow reaches checkpoint next time, older state is discarded.

Implementation Notes

One of the options is to attach checkpoint to a subflow:

 linear_flow.Flow(name="root").add(
      DoSomethingTask(),
      linear_flow.Flow(name="subflow", checkpoint=Checkpoint(name="Something is DONE", retry=5)).add(
           DoSomethingElseTask(),
           LastTask())
      )
 )

If LastTask in the example fails, DoSomethingElseTask() should be reverted, and than DoSomethingElseTask and LastTask should be attempted again.

Another option is to represent checkpoints as a special type of execution graph nodes, and have them added to flow in the same way as tasks:

 linear_flow.Flow(name="same as above").add(
               DoSomethingTask(),
               Checkpoint(name="Something is DONE", retry=5),
               DoSomethingElseTask(),
               LastTask())

@@ Line 3: / Line 3: @@
 future.
-''Checkpoint'' identifies a state of a flow. Checkpoints may be associated with
+''Checkpoint'' identifies a state of a flow, or, so to say, marks a point in a flow.
-particular point of flow graph, before the particular task, or after the
+Checkpoints are associated with particular point of flow graph, before the
-particular task, or "between" certain blocks. Checkpoints are part of flow
+particular task, or after the particular task, or "between" certain tasks or subflows.
-definition.
+Checkpoints are part of flow definition.
-Checkpoints may be helpful to:
+Checkpoints may have many uses, including:
 * manage how flow is run
 ** tell the engine to stop at particular checkpoint
 ** revert the flow to particular checkpoint
+** revert the flow to particular checkpoint and then retry
 * manage how state (history) is saved
 ** associate data with a checkpoint
 ** discard history up to particular checkpoint
-== Reverting Policies ==
+== Reversion Strategies ==
+'''Blueprint''': [https://blueprints.launchpad.net/taskflow/+spec/reversion-strategies bp:reversion-strategies]
 In simplest case, when error occurs, the flow should be reverted to initial
@@ Line 28: / Line 31: @@
 This (and possibly other) reverting policies should be specified at flow
-definition. Checkpoints should be the tool for it.
+definition. Checkpoints might be the tool for it.
-Operators or users (through external services) may also need some kind of
-external control on how particular checkpoint works, e.g. to stop flow cycling
-around and revert it completely if recovery turned out to be  impossible.
 == Go to Checkpoint ==
@@ Line 48: / Line 47: @@
 == Discarding State ==
+'''Blueprints''': [https://blueprints.launchpad.net/taskflow/+spec/checkpointing bp:checkpointing] [https://blueprints.launchpad.net/taskflow/+spec/book-retention bp:book-retention]
 For long-running flows keeping all the history might become a problem, in a way
@@ Line 61: / Line 62: @@
 :51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that
-Discarding the state may be subject to policy specified by logbook or engine.
+Discarding the state may be subject to policy. Checkpoint is natural place to attach such policy to.
 For example, user might want to save the whole history for last 5 checkpoints,
 or checkpoints for last 3 days. When flow reaches checkpoint next time,
@@ Line 68: / Line 69: @@
 == Implementation Notes ==
-Checkpoints can be represented as a special type of blocks, and added to flow
+One of the options is to attach checkpoint to a subflow:
-in the same way as tasks:
-  my_flow = (LinearFlow(name='my flow')
-                .append(DoSomethingTask())
-                .append(Checkpoint(name="Something is DONE", on_revert='retry'))
-                .append(DoSomethingElseTask())
-                .append(LastTask()))
+  linear_flow.Flow(name="root").add(
+       DoSomethingTask(),
+       linear_flow.Flow(name="subflow", checkpoint=Checkpoint(name="Something is DONE", retry=5)).add(
+            DoSomethingElseTask(),
+            LastTask())
+       )
+  )
 If LastTask in the example fails, DoSomethingElseTask() should be reverted,
 and than DoSomethingElseTask and LastTask should be attempted again.
+Another option is to represent checkpoints as a special type of execution graph nodes, and have them added to flow
+in the same way as tasks:
+  linear_flow.Flow(name="same as above").add(
+                DoSomethingTask(),
+                Checkpoint(name="Something is DONE", retry=5),
+                DoSomethingElseTask(),
+                LastTask())