Difference between revisions of "TaskFlow/Checkpointing"
(Created page with " Checkpointing is a nice idea we might want to see implemented in taskflow in future. ''Checkpoint'' identifies a state of a flow. Checkpoints may be associated with particul...") |
|||
Line 3: | Line 3: | ||
future. | future. | ||
− | ''Checkpoint'' identifies a state of a flow. Checkpoints | + | ''Checkpoint'' identifies a state of a flow, or, so to say, marks a point in a flow. |
− | particular point of flow graph, before the particular task, or after the | + | Checkpoints are associated with particular point of flow graph, before the |
− | particular task, or "between" certain | + | particular task, or after the particular task, or "between" certain tasks or subflows. |
− | definition. | + | Checkpoints are part of flow definition. |
− | Checkpoints may | + | Checkpoints may have many uses, including: |
* manage how flow is run | * manage how flow is run | ||
** tell the engine to stop at particular checkpoint | ** tell the engine to stop at particular checkpoint | ||
** revert the flow to particular checkpoint | ** revert the flow to particular checkpoint | ||
+ | ** revert the flow to particular checkpoint and then retry | ||
* manage how state (history) is saved | * manage how state (history) is saved | ||
** associate data with a checkpoint | ** associate data with a checkpoint | ||
** discard history up to particular checkpoint | ** discard history up to particular checkpoint | ||
− | == | + | == Reversion Strategies == |
+ | |||
+ | '''Blueprint''': [https://blueprints.launchpad.net/taskflow/+spec/reversion-strategies bp:reversion-strategies] | ||
In simplest case, when error occurs, the flow should be reverted to initial | In simplest case, when error occurs, the flow should be reverted to initial | ||
Line 28: | Line 31: | ||
This (and possibly other) reverting policies should be specified at flow | This (and possibly other) reverting policies should be specified at flow | ||
− | definition. Checkpoints | + | definition. Checkpoints might be the tool for it. |
− | |||
− | |||
− | |||
− | |||
== Go to Checkpoint == | == Go to Checkpoint == | ||
Line 48: | Line 47: | ||
== Discarding State == | == Discarding State == | ||
+ | |||
+ | '''Blueprints''': [https://blueprints.launchpad.net/taskflow/+spec/checkpointing bp:checkpointing] [https://blueprints.launchpad.net/taskflow/+spec/book-retention bp:book-retention] | ||
For long-running flows keeping all the history might become a problem, in a way | For long-running flows keeping all the history might become a problem, in a way | ||
Line 61: | Line 62: | ||
20:51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that | 20:51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that | ||
− | Discarding the state may be subject to policy | + | Discarding the state may be subject to policy. Checkpoint is natural place to attach such policy to. |
For example, user might want to save the whole history for last 5 checkpoints, | For example, user might want to save the whole history for last 5 checkpoints, | ||
or checkpoints for last 3 days. When flow reaches checkpoint next time, | or checkpoints for last 3 days. When flow reaches checkpoint next time, | ||
Line 68: | Line 69: | ||
== Implementation Notes == | == Implementation Notes == | ||
− | + | One of the options is to attach checkpoint to a subflow: | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | linear_flow.Flow(name="root").add( | ||
+ | DoSomethingTask(), | ||
+ | linear_flow.Flow(name="subflow", checkpoint=Checkpoint(name="Something is DONE", retry=5)).add( | ||
+ | DoSomethingElseTask(), | ||
+ | LastTask()) | ||
+ | ) | ||
+ | ) | ||
+ | |||
If LastTask in the example fails, DoSomethingElseTask() should be reverted, | If LastTask in the example fails, DoSomethingElseTask() should be reverted, | ||
and than DoSomethingElseTask and LastTask should be attempted again. | and than DoSomethingElseTask and LastTask should be attempted again. | ||
+ | |||
+ | Another option is to represent checkpoints as a special type of execution graph nodes, and have them added to flow | ||
+ | in the same way as tasks: | ||
+ | |||
+ | linear_flow.Flow(name="same as above").add( | ||
+ | DoSomethingTask(), | ||
+ | Checkpoint(name="Something is DONE", retry=5), | ||
+ | DoSomethingElseTask(), | ||
+ | LastTask()) |
Revision as of 15:47, 23 October 2013
Checkpointing is a nice idea we might want to see implemented in taskflow in future.
Checkpoint identifies a state of a flow, or, so to say, marks a point in a flow. Checkpoints are associated with particular point of flow graph, before the particular task, or after the particular task, or "between" certain tasks or subflows. Checkpoints are part of flow definition.
Checkpoints may have many uses, including:
- manage how flow is run
- tell the engine to stop at particular checkpoint
- revert the flow to particular checkpoint
- revert the flow to particular checkpoint and then retry
- manage how state (history) is saved
- associate data with a checkpoint
- discard history up to particular checkpoint
Reversion Strategies
Blueprint: bp:reversion-strategies
In simplest case, when error occurs, the flow should be reverted to initial state, as if it was never run. But other cases exist:
- one want to revert several tasks, and then try to run them again;
- maybe right now, to take advantage of some kind of HA setup
- maybe later, to give operator chance to fix things before retry
- for long-running flows (like distributed flows running some actions periodically) it makes no sense to revert them all; so, the flow should be reverted to some checkpoint (e.g. last one), and then considered to be reverted.
This (and possibly other) reverting policies should be specified at flow definition. Checkpoints might be the tool for it.
Go to Checkpoint
We can go farther in providing manual or, more generally, external control of a flow. One of possible ideas how we do that is allow user (or any other external entity) to order engine to get to particular checkpoint in the flow. Then, some tasks are executed, or some other tasks are reverted, or both. When checkpoint is reached the engine stops leaving flow in interrupted state.
One of possible use cases of that is Anvil. The whole process from bootstrapping to installed and running (and/or tested) OpenStack can be represented as one great flow, with anvil actions just directing an internal engine to go to particular checkpoint.
Discarding State
Blueprints: bp:checkpointing bp:book-retention
For long-running flows keeping all the history might become a problem, in a way similar to what heat has hit. While the problem scale is smaller for taskflow (at least in short term), for infinitely running flows like ones running periodic tasks some measures should be taken.
Checkpoints might be helpful there. As Jessica Lucci pointed out:
20:51:16 <jlucci> I'm not sure if it's feasible, but the idea is that once you get to a checkpoint, 20:51:26 <jlucci> you can sort of consolidate the data up to that point 20:51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that
Discarding the state may be subject to policy. Checkpoint is natural place to attach such policy to. For example, user might want to save the whole history for last 5 checkpoints, or checkpoints for last 3 days. When flow reaches checkpoint next time, older state is discarded.
Implementation Notes
One of the options is to attach checkpoint to a subflow:
linear_flow.Flow(name="root").add( DoSomethingTask(), linear_flow.Flow(name="subflow", checkpoint=Checkpoint(name="Something is DONE", retry=5)).add( DoSomethingElseTask(), LastTask()) ) )
If LastTask in the example fails, DoSomethingElseTask() should be reverted, and than DoSomethingElseTask and LastTask should be attempted again.
Another option is to represent checkpoints as a special type of execution graph nodes, and have them added to flow in the same way as tasks:
linear_flow.Flow(name="same as above").add( DoSomethingTask(), Checkpoint(name="Something is DONE", retry=5), DoSomethingElseTask(), LastTask())