TaskFlow/Patterns and Engines/Persistence

Revised on: 9/26/2013 by Harlowja

Big Picture

How can we persist the flow so that it can be resumed, restarted or rolled-back on engine failure?

Since a Flow is a set of tasks and relations between tasks we need to create a model and corresponding information that allows us to persist the right amount of information to preserve, resume, and rollback a flow on software/hardware failure. To start when a flow is loaded into engine, there is a translation step that associates that flow with persistent data in a storage backend. In particular, for each flow there is FlowDetail record, and for each task there is TaskDetail record. These form the basic level of information about how a flow will be persisted (see below for there schema).

To allow for resumption taskflow must be able to re-create the flow and re-connect the links between tasks in order to revert them or resume them in the correct order. For that to happen, typically a factory function should be provided. The fully qualified name of the function should be saved in FlowDetail; then, we'll be able to import and call this function again, e. g. after service restart.

Note: Putting flow creation in a separate function should not be to much burden -- the pattern is already used, look for example at [cinder code].

Requirements

A task is associated with TaskDetail by name. This requires that task names are unique within particular flow.

First-Time Loading

When new flow is loaded into engine, there are no persisted data for it yet, so a corresponding FlowDetail object will be created, as well as a TaskDetail object for each task and these will be immediately saved into the persistence backend that is configured (if any).

Resuming Same Flow

When we resume the flow from database (for example, if the flow was interrupted and engine destroyed to save resources or if the service was restarted), we need to re-create the flow. For that, we will call the function that was saved on first-time loading that builds the flow for us (aka; the flow factory function described above).

Then, we load the flow into engine. For each task, there must already be TaskDetail object. We then re-associate tasks with TaskDetail by using task names and then the engine resumes. In this scenario task states and results should be the only information needed to resume the flow.

Resuming Changed Flow

The upgrade use-case is probably more interesting and challenging. While there are several options, the best (and recommended) practice for upgrades should be loading a new (changed) flows with state of older flows. This allows for that new/updated flow to take advantage of preexisting (potentially partially) completed state, or cleanup from that preexisting state using a well-behaved and well-defined process.

A couple of options are possible here:

This is done with the same process as loading a unchanged flow: tasks are associated with saved state by names when the flow is loaded into engine.

Note: Likely though this will not work out-of-the box for all use cases: sometimes, data/task migrations (much like database migrations, but on higher level) will be needed.

Let's consider several use-cases.

Task was added

This is the simplest use case.

As there is no state for the new task, a new TaskDetail record will be created for it automatically.

Task was removed

Nothing should be done -- flow structure is reloaded from factory function, and removed task is not in it -- so, flow will be ran as if it was not there, and any results it returned if it was completed before will be ignored.

Task code was changed

Task was split in two

Flow structure was changed

If manual links were added or removed from graph, or task requirements were changed, or flow was refactored (task moved into or out of subflows, linear flow was replaced with graph flow, tasks were reordered in linear flow, etc), nothing particular should be done. All task state are loaded as usual.

Design rationales

Flow Factory

How do we upgrade?

We change the code that creates the flow, then we restart the service.

Then, when flow is restored from storage, what we really want is to load new flow, with updated structure and updated tasks, but preserve task states and results. At least, that is most common case.

So, the code should be run to put the tasks into patterns and re-create the flow, but logbook should be consulted to load state and results of tasks.

So, creation of the flow should be put into separate function.

Using Names to Associate Tasks with State

The engine gets a flow and flow details, and should reconstruct its own internal state.

A task should be somehow matched with existing TaskDetail(s). The match should be:

stable if tasks are added or removed;
should not change when service is restarted, upgraded;
should be the same across all server instances in HA setups.

One option is that tasks should be matched with TaskDetail(s) by task name. This has several implications:

the names of tasks should be unique in flow;
it becomes too hard to change the name of task.

Types

Task detail

Stores all of the information associated with one specific run instance of a task.

Field	Description
Name	Name of the task
Type	Type of the flow (mod.cls format)
UUID	Unique identifier for the task
State	State of the task
Results	Results that the task may have produced
Exception	Serialized exception that the task may have produced
Stack trace	Stack trace of the exception that the task may have produced
Version	Version of the task that was ran
Meta	JSON blob of non-indexable associated task information

Flow detail

Stores a collection of task details, metadata about the flow and potentially any task relationships.
Persistence representation of a specific run instance of a flow.
Provides all of the details necessary for automatic reconstruction of a flow object.

Field	Description
Name	Name of the flow
Type	Type of the flow (mod.cls format)
UUID	Unique identifier for the flow
State	State of the flow
Meta	JSON blob of non-indexable associated flow information

Logbook

Stores a collection of flow details + any metadata about the logbook (last_updated, deleted, name...).
Typically connected to job with which the logbook has a one-to-one relationship.
Provides all of the data necessary to automatically reconstruct a job object.

Field	Description
Name	Name of the logbook
UUID	Unique identifier for the logbook
Meta	JSON blob of non-indexable associated logbook information

Open Questions

Are there any good alternatives to using task names?

If task were removed from flow, it will stuck in whatever state it was before older version of flow was interrupted.