Checkpointing is a nice idea we might want to see implemented in taskflow in future.
Checkpoint identifies a state of a flow, or, so to say, marks a point in a flow. Checkpoints are associated with particular point of flow graph, before the particular task, or after the particular task, or "between" certain tasks or subflows. Checkpoints are part of flow definition.
Checkpoints may have many uses, including:
- manage how flow is run
- tell the engine to stop at particular checkpoint
- revert the flow to particular checkpoint
- revert the flow to particular checkpoint and then retry
- manage how state (history) is saved
- associate data with a checkpoint
- discard history up to particular checkpoint
In simplest case, when error occurs, the flow should be reverted to initial state, as if it was never run. But other cases exist:
- one want to revert several tasks, and then try to run them again;
- maybe right now, to take advantage of some kind of HA setup
- maybe later, to give operator chance to fix things before retry
- for long-running flows (like distributed flows running some actions periodically) it makes no sense to revert them all; so, the flow should be reverted to some checkpoint (e.g. last one), and then considered to be reverted.
This (and possibly other) reverting policies should be specified at flow definition. Checkpoints might be the tool for it.
Go to Checkpoint
We can go farther in providing manual or, more generally, external control of a flow. One of possible ideas how we do that is allow user (or any other external entity) to order engine to get to particular checkpoint in the flow. Then, some tasks are executed, or some other tasks are reverted, or both. When checkpoint is reached the engine stops leaving flow in interrupted state.
One of possible use cases of that is Anvil. The whole process from bootstrapping to installed and running (and/or tested) OpenStack can be represented as one great flow, with anvil actions just directing an internal engine to go to particular checkpoint.
For long-running flows keeping all the history might become a problem, in a way similar to what heat has hit. While the problem scale is smaller for taskflow (at least in short term), for infinitely running flows like ones running periodic tasks some measures should be taken.
Checkpoints might be helpful there. As Jessica Lucci pointed out:
20:51:16 <jlucci> I'm not sure if it's feasible, but the idea is that once you get to a checkpoint, 20:51:26 <jlucci> you can sort of consolidate the data up to that point 20:51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that
Discarding the state may be subject to policy. Checkpoint is natural place to attach such policy to. For example, user might want to save the whole history for last 5 checkpoints, or checkpoints for last 3 days. When flow reaches checkpoint next time, older state is discarded.
One of the options is to attach checkpoint to a subflow:
linear_flow.Flow(name="root").add( DoSomethingTask(), linear_flow.Flow(name="subflow", checkpoint=Checkpoint(name="Something is DONE", retry=5)).add( DoSomethingElseTask(), LastTask()))
If LastTask in the example fails, DoSomethingElseTask() should be reverted, and than DoSomethingElseTask and LastTask should be attempted again.
Another option is to represent checkpoints as a special type of execution graph nodes, and have them added to flow in the same way as tasks:
linear_flow.Flow(name="same as above").add( DoSomethingTask(), Checkpoint(name="Something is DONE", retry=5), DoSomethingElseTask(), LastTask())