Jump to: navigation, search

Difference between revisions of "Heat/ConvergenceDesignNotesByMike"

(Workflow Engine or Not)
(Workflow Engine or Not)
Line 5: Line 5:
 
== Workflow Engine or Not ==
 
== Workflow Engine or Not ==
  
The desires to support very large stacks (size of 1E6 resources was mentioned), react quickly to new stack operations, and efficiently support a hypothetical new incremental stack update operation appear to have side-tracked the idea of using a workflow engine.  Here is my net of the issues.  Suppose a workflow engine with a clean interrupt operation: it stops the launch of new actions for the interrupted workflow, and waits for completion of the actions currently in progress.  Heat could use such a workflow engine.  When some differences between target and observed state are detected, heat would compose a workflow to heal the differences and launch it.  If later more differences are detected before that workflow completes, that workflow would be interrupted and then a new workflow composed and launched to handle all the current differences.  This would probably require yet another copy of state: in addition to target and observed state there would be another kind of state that is observed state overwritten by the goals of the workflow (if any) currently in progress; let's call that "anticipated state".  It would actually be planned state, rather than target state, that is compared with observed state to drive workflow composition and launch.
+
The desires to support very large stacks (size of 1E6 resources was mentioned), react quickly to new stack operations, and efficiently support a hypothetical new incremental stack update operation appear to have side-tracked the idea of using a workflow engine.  Here is my net of the issues.  Suppose a workflow engine with a clean interrupt operation: it stops the launch of new actions for the interrupted workflow, and waits for completion of the actions currently in progress.  Heat could use such a workflow engine.  When some differences between target and observed state are detected, heat would compose a workflow to heal the differences and launch it.  If later more differences are detected before that workflow completes, that workflow would be interrupted and then a new workflow composed and launched to handle all the current differences.  This would probably require yet another copy of state: in addition to target and observed state there would be another kind of state that is observed state overwritten by the goals of the workflow (if any) currently in progress; let's call that "anticipated state".  It would actually be anticipated state, rather than target state, that is compared with observed state to drive workflow composition and launch.
  
 
There is a desire to add an incremental stack update operation, which would not take a whole revised template+effective_environment+parameters but rather some description of an incremental change.  There is a desire for an efficient implementation of this hypothetical operation, particularly in the case of a large stack and a small change.  It might be possible for the implementation to interrupt a workflow in progress and incrementally compute the needed revised workflow, and possibly even --- as an optimization --- detect the special case where there is no intersection between the new delta and the workflow(s) currently in progress and in that case compose and launch an additional independent workflow.  But this is going to be pretty complex logic.
 
There is a desire to add an incremental stack update operation, which would not take a whole revised template+effective_environment+parameters but rather some description of an incremental change.  There is a desire for an efficient implementation of this hypothetical operation, particularly in the case of a large stack and a small change.  It might be possible for the implementation to interrupt a workflow in progress and incrementally compute the needed revised workflow, and possibly even --- as an optimization --- detect the special case where there is no intersection between the new delta and the workflow(s) currently in progress and in that case compose and launch an additional independent workflow.  But this is going to be pretty complex logic.

Revision as of 22:38, 28 May 2014

Convergence Design Notes by Mike

In the Juno design summit etherpad (https://etherpad.openstack.org/p/heat-workflow-vs-convergence) there are several design problems and solutions discussed, with final selection unclear. In this page I offer one opinion. The contentious issues include: to use a workflow engine or not, where to store observed state, whether and (if so) how to chunk the work. In another dimension, there is the question of how to roadmap the work.

Workflow Engine or Not

The desires to support very large stacks (size of 1E6 resources was mentioned), react quickly to new stack operations, and efficiently support a hypothetical new incremental stack update operation appear to have side-tracked the idea of using a workflow engine. Here is my net of the issues. Suppose a workflow engine with a clean interrupt operation: it stops the launch of new actions for the interrupted workflow, and waits for completion of the actions currently in progress. Heat could use such a workflow engine. When some differences between target and observed state are detected, heat would compose a workflow to heal the differences and launch it. If later more differences are detected before that workflow completes, that workflow would be interrupted and then a new workflow composed and launched to handle all the current differences. This would probably require yet another copy of state: in addition to target and observed state there would be another kind of state that is observed state overwritten by the goals of the workflow (if any) currently in progress; let's call that "anticipated state". It would actually be anticipated state, rather than target state, that is compared with observed state to drive workflow composition and launch.

There is a desire to add an incremental stack update operation, which would not take a whole revised template+effective_environment+parameters but rather some description of an incremental change. There is a desire for an efficient implementation of this hypothetical operation, particularly in the case of a large stack and a small change. It might be possible for the implementation to interrupt a workflow in progress and incrementally compute the needed revised workflow, and possibly even --- as an optimization --- detect the special case where there is no intersection between the new delta and the workflow(s) currently in progress and in that case compose and launch an additional independent workflow. But this is going to be pretty complex logic.

The alternative is an approach that does not use a workflow engine. Rather, multiple heat engines conspire to do the actions themselves. This appears preferable to me, although I am not happy with the degree to which this involves duplicating functionality of a workflow system.

Where to Store Observed State

I saw mention of three approaches: (1) store it in the DB, (2) store it in memcached, and (3) read it whenever needed from the authoritative source. Approach (3) was roundly dismissed. Approach (2) raises questions about consistency. It is not clear to me that we have a problem with consistency, that depends on other parts of the design. For now let us assume approach (1), and later revisit this question.

To Chunk or Not

Some of the discussion was around the idea of breaking a large stack operation into smallish batches of individual resource operations. There is even a reference to an academic paper on graph partitioning, which could be used to do that breaking into batches. Even with batching (chunking) there remains the problem of doing each operation only after its dependencies are satisfied. With multiple engines there is also the problem of avoiding duplicated or inconsistent work (remember the desire for the ability to start working on a new stack operation before the old one is finished). Suppose one large stack operation arrives, is broken into chunks, and they are distributed and begun execution. While that execution is going on, a stack update arrives that adds and removes resources. The set of chunks is now necessarily different; how to coordinate with the chunks in progress?

The alternative, which I favor, is to "atomize" the work: focus on individual resource operations. As each becomes enabled it is anycast ("cast" in oslo RPC terminology) to the heat engines, one of which picks it up and works on it. The work would be done while holding a lock on the individual resource, to prevent a concurrent engine doing redundant or inconsistent work on that resource. This would be a lock that prevents concurrent execution but not concurrent update of target state (so that a new stack operation can be received and its target state persisted while an old one is in progress). When an engine completes work on one resource, it would compute which differences are newly enabled to be executed and do the corresponding anycasts. This outline needs to be adjusted to accommodate target state changes that arrive during execution.

Engine and Resource Manager Failure

Heat engines and resource managers (the nova api process, etc.) can fail. We suppose systemd or some such thing detects failures and launches replacement processes as needed.

With resource operations atomized and anycast, recovery from a heat engine crash is relatively easy. Upon startup, a heat engine queries the DB for a resource that is locked by a heat engine that is no longer running (sub-problem: how to determine these), and if one is found then does recovery/resume of that work in progress. Repeat until no orphaned work is found, then go into regular operation.

Fail-stop is the easy case; other possible failures include network partitions, partial wedges, and astounding general slowness in a process (including interactive usage of a debugger). We have to cope with false positives in the failure detector. In particular, a process thought to be failed might stumble back to life and do some more work before it notices that it is supposed to be dead. Similarly, network packets might be delayed. A TCP peer does not consider an extended period of inaction to be a connection break.

The universal pattern of resource managers is that there is a creation operation that both creates a resource and allocates its one and only unique identifier, returning that identifier. This is inherently problematic because it prevents idempotent usage.

When there is a failure between (1) the time a heat engine persists its intent to a request creation of a resource and (2) the time the heat engine persists the UUID of the created resource, it is unclear whether the resource was created. Heat has to assume the resource was not created. This can lead to orphaned resources, and fixing this problem involves changing resource creation APIs to be idempotent.

Roadmap

The first step is changing the DB schema to support the new design.

Once that is done, it will be possible to fix the worst problem with a small patch. The worst problem is that a stack update that fails partway through has no correct DB update it can make; with the DB schema changed, this goes away; we can make stack UPDATE properly update observed state as it goes along, so a failure partway through leaves the DB in a state from which further UPDATE or DELETE operations can correctly proceed.

With the worst problem fixed, we can then turn our attention to the more complete solution.