TaskFlow/Checkpointing

Checkpointing is a nice idea we might want to see implemented in taskflow in future.

Checkpoint identifies a state of a flow. Checkpoints may be associated with particular point of flow graph, before the particular task, or after the particular task, or "between" certain blocks. Checkpoints are part of flow definition.

Checkpoints may be helpful to:

manage how flow is run
- tell the engine to stop at particular checkpoint
- revert the flow to particular checkpoint
manage how state (history) is saved
- associate data with a checkpoint
- discard history up to particular checkpoint

Reverting Policies

In simplest case, when error occurs, the flow should be reverted to initial state, as if it was never run. But other cases exist:

one want to revert several tasks, and then try to run them again;
- maybe right now, to take advantage of some kind of HA setup
- maybe later, to give operator chance to fix things before retry
for long-running flows (like distributed flows running some actions periodically) it makes no sense to revert them all; so, the flow should be reverted to some checkpoint (e.g. last one), and then considered to be reverted.

This (and possibly other) reverting policies should be specified at flow definition. Checkpoints should be the tool for it.

Operators or users (through external services) may also need some kind of external control on how particular checkpoint works, e.g. to stop flow cycling around and revert it completely if recovery turned out to be impossible.

Go to Checkpoint

We can go farther in providing manual or, more generally, external control of a flow. One of possible ideas how we do that is allow user (or any other external entity) to order engine to get to particular checkpoint in the flow. Then, some tasks are executed, or some other tasks are reverted, or both. When checkpoint is reached the engine stops leaving flow in interrupted state.

One of possible use cases of that is Anvil. The whole process from bootstrapping to installed and running (and/or tested) OpenStack can be represented as one great flow, with anvil actions just directing an internal engine to go to particular checkpoint.

Discarding State

For long-running flows keeping all the history might become a problem, in a way similar to what heat has hit. While the problem scale is smaller for taskflow (at least in short term), for infinitely running flows like ones running periodic tasks some measures should be taken.

Checkpoints might be helpful there. As Jessica Lucci pointed out:

20:51:16 <jlucci> I'm not sure if it's feasible, but the idea is that once you get to a checkpoint,
20:51:26 <jlucci> you can sort of consolidate the data up to that point
20:51:37 <jlucci> You don't have to keep track of the states of any previous tasks or anything like that

Discarding the state may be subject to policy specified by logbook or engine. For example, user might want to save the whole history for last 5 checkpoints, or checkpoints for last 3 days. When flow reaches checkpoint next time, older state is discarded.

Implementation Notes

Checkpoints can be represented as a special type of blocks, and added to flow in the same way as tasks:

 my_flow = (LinearFlow(name='my flow')
               .append(DoSomethingTask())
               .append(Checkpoint(name="Something is DONE", on_revert='retry'))
               .append(DoSomethingElseTask())
               .append(LastTask()))

If LastTask in the example fails, DoSomethingElseTask() should be reverted, and than DoSomethingElseTask and LastTask should be attempted again.

TaskFlow/Checkpointing

Contents

Reverting Policies

Go to Checkpoint

Discarding State

Implementation Notes