Jump to: navigation, search

Difference between revisions of "Heat/ConvergenceDesignNotesByMike"

(Workflow Engine or Not)
(To Chunk or Not)
Line 17: Line 17:
 
== To Chunk or Not ==
 
== To Chunk or Not ==
  
Some of the discussion was around the idea of breaking of stack operation into smallish batches of individual resource operations.  There is even a reference to an academic paper on graph partitioning, which could be used to do that breaking into batches.  Even with batching (chunking) there remains the problem of doing each operation only after its dependencies are satisfied.  With multiple engines there is the problem of avoiding duplicated or inconsistent work (remember the desire for the ability to start working on a new stack operation before the old one is finished).  Suppose one large stack operation arrives, is broken into chunks, and they are distributed and begun execution.  While that execution is going on, a stack update arrives that adds and removes resources.  The set of chunks is now necessarily different; how to coordinate with the chunks in progress?
+
Some of the discussion was around the idea of breaking a large stack operation into smallish batches of individual resource operations.  There is even a reference to an academic paper on graph partitioning, which could be used to do that breaking into batches.  Even with batching (chunking) there remains the problem of doing each operation only after its dependencies are satisfied.  With multiple engines there is also the problem of avoiding duplicated or inconsistent work (remember the desire for the ability to start working on a new stack operation before the old one is finished).  Suppose one large stack operation arrives, is broken into chunks, and they are distributed and begun execution.  While that execution is going on, a stack update arrives that adds and removes resources.  The set of chunks is now necessarily different; how to coordinate with the chunks in progress?
 +
 
 +
The alternative, which I favor, is to "atomize" the work: focus on individual resource operations.  As each becomes enabled it is anycast ("cast" in oslo RPC terminology) to the heat engines, one of which picks it up and works on it.  The work would be done while holding a lock on the individual resource, to prevent a concurrent engine doing redundant or inconsistent work on that resource.  This would be a lock that prevents concurrent execution but not concurrent update of target state (so that a new stack operation can be received and its target state persisted while an old one is in progress).  When an engine completes work on one resource, it would compute which differences are newly enabled to be executed and do the corresponding anycasts.  ''This outline needs to be adjusted to accommodate target state changes that arrive during execution.''
 +
 
 +
== Engine Failure ==
 +
 
 +
Heat engines can fail.  We suppose '''systemd''' or some such thing detects failures and launches replacement engines as needed.
 +
 
 +
Fail-stop is the easy case; other possible failures include network partitions, partial wedges, and astounding general slowness in a process (including interactive usage of a debugger).  We have to cope with false positives in the failure detector.  In particular, an engine thought to be failed might stumble back to life and do some more work before it notices that it is supposed to be dead.

Revision as of 22:00, 28 May 2014

Convergence Design Notes by Mike

In the Juno design summit etherpad (https://etherpad.openstack.org/p/heat-workflow-vs-convergence) there are several design problems and solutions discussed, with final selection unclear. In this page I offer one opinion. The contentious issues include: to use a workflow engine or not, whether and (if so) how to chunk the work.

Workflow Engine or Not

The desires to support very large stacks (size of 1E6 resources was mentioned), react quickly to new stack operations, and efficiently support a hypothetical new incremental stack update operation appear to have side-tracked the idea of using a workflow engine. Here is my net of the issues. Suppose a workflow engine with a clean interrupt operation: it stops the launch of new actions for the interrupted workflow, and waits for completion of the actions currently in progress. Heat could use such a workflow engine. When some differences between target and observed state are detected, heat would compose a workflow to heal the differences and launch it. If later more differences are detected before that workflow completes, that workflow would be interrupted and then a new workflow composed and launched to handle all the current differences. This would probably require yet another copy of state: in addition to target and observed state there would be another kind of state that is observed state overwritten by the goals of the workflow (if any) currently in progress; let's call that "anticipated state". The kind of difference that causes a new/revised workflow is a difference between anticipated state and target state.

There is a desire to add an incremental stack update operation, which would not take a whole revised template+effective_environment+parameters but rather some description of an incremental change. There is a desire for an efficient implementation of this hypothetical operation, particularly in the case of a large stack and a small change. It might be possible for the implementation to interrupt a workflow in progress and incrementally compute the needed revised workflow, and possibly even --- as an optimization --- detect the special case where there is no intersection between the new delta and the workflow(s) currently in progress and in that case compose and launch an additional independent workflow. But this is going to be pretty complex logic.

The alternative is an approach that does not use a workflow engine. Rather, multiple heat engines conspire to do the actions themselves. This appears preferable to me, although I am not happy with the degree to which this involves duplicating functionality of a workflow system.

Where to Store Observed State

I saw mention of three approaches: (1) store it in the DB, (2) store it in memcached, and (3) read it whenever needed from the authoritative source. Approach (3) was roundly dismissed. Approach (2) raises questions about consistency. It is not clear to me that we have a problem with consistency, that depends on other parts of the design. For now let us assume approach (1), and later revisit this question.

To Chunk or Not

Some of the discussion was around the idea of breaking a large stack operation into smallish batches of individual resource operations. There is even a reference to an academic paper on graph partitioning, which could be used to do that breaking into batches. Even with batching (chunking) there remains the problem of doing each operation only after its dependencies are satisfied. With multiple engines there is also the problem of avoiding duplicated or inconsistent work (remember the desire for the ability to start working on a new stack operation before the old one is finished). Suppose one large stack operation arrives, is broken into chunks, and they are distributed and begun execution. While that execution is going on, a stack update arrives that adds and removes resources. The set of chunks is now necessarily different; how to coordinate with the chunks in progress?

The alternative, which I favor, is to "atomize" the work: focus on individual resource operations. As each becomes enabled it is anycast ("cast" in oslo RPC terminology) to the heat engines, one of which picks it up and works on it. The work would be done while holding a lock on the individual resource, to prevent a concurrent engine doing redundant or inconsistent work on that resource. This would be a lock that prevents concurrent execution but not concurrent update of target state (so that a new stack operation can be received and its target state persisted while an old one is in progress). When an engine completes work on one resource, it would compute which differences are newly enabled to be executed and do the corresponding anycasts. This outline needs to be adjusted to accommodate target state changes that arrive during execution.

Engine Failure

Heat engines can fail. We suppose systemd or some such thing detects failures and launches replacement engines as needed.

Fail-stop is the easy case; other possible failures include network partitions, partial wedges, and astounding general slowness in a process (including interactive usage of a debugger). We have to cope with false positives in the failure detector. In particular, an engine thought to be failed might stumble back to life and do some more work before it notices that it is supposed to be dead.