Convergence Design Notes by Mike

In the Juno design summit etherpad (https://etherpad.openstack.org/p/heat-workflow-vs-convergence) there are several design problems and solutions discussed, with final selection unclear. In this page I offer one opinion. The contentious issues include: kinds of agents, to use a workflow engine or not, where to store observed state, whether and (if so) how to chunk the work, and how to handle stack adopt. In another dimension, there is the question of how to roadmap the work.

Kinds of Agents

I like to think of three kinds of agents: desired state setters, observers, and convergence engines. Today's heat engines can evolve to set desired state, and possibly to also serve as convergence engines.

Workflow Engine or Not

The desires to support very large stacks (size of 1E6 resources was mentioned), react quickly to new stack operations, and efficiently support a hypothetical new incremental stack update operation appear to have side-tracked the idea of using a workflow engine. Here is my net of the issues. Suppose a workflow engine with a clean interrupt operation: it stops the launch of new actions for the interrupted workflow, and waits for completion of the actions currently in progress. Heat could use such a workflow engine. When some differences between desired and observed state are detected, heat would compose a workflow to heal the differences and launch it. If later more differences are detected before that workflow completes, that workflow would be interrupted and then a new workflow composed and launched to handle all the current differences. This would probably require yet another copy of state: in addition to desired and observed state there would be another kind of state that is observed state overwritten by the goals of the workflow (if any) currently in progress; let's call that "anticipated state". It would actually be anticipated state, rather than observed state, that is compared with desired state to drive workflow composition and launch.

There is a desire to add an incremental stack update operation, which would not take a whole revised template+effective_environment+parameters but rather some description of an incremental change. There is a desire for an efficient implementation of this hypothetical operation, particularly in the case of a large stack and a small change. It might be possible for the implementation to interrupt a workflow in progress and incrementally compute the needed revised workflow, and possibly even --- as an optimization --- detect the special case where there is no intersection between the new delta and the workflow(s) currently in progress and in that case compose and launch an additional independent workflow. But this is going to be pretty complex logic.

An alternative approach does not use a workflow engine; rather, multiple convergence engines conspire to do the actions themselves. This appears preferable to me, although I am not happy with the degree to which this involves duplicating functionality of a workflow system.

Where to Store Observed State

I saw mention of three approaches: (1) store it in the DB, (2) store it in memcached, and (3) read it whenever needed from the authoritative source. Approach (3) was roundly dismissed. Approach (2) raises questions about consistency. It is not clear to me that we have a problem with consistency, that depends on other parts of the design. For now let us assume approach (1), and later revisit this question.

To Chunk or Not

Some of the discussion was around the idea of breaking a large stack operation into smallish batches of individual resource operations. There is even a reference to an academic paper on graph partitioning, which could be used to do that breaking into batches. Even with batching (chunking) there remains the problem of doing each operation only after its dependencies are satisfied. With multiple engines there is also the problem of avoiding duplicated or inconsistent work (remember the desire for the ability to start working on a new stack operation before the old one is finished). Suppose one large stack operation arrives, is broken into chunks, and they are distributed and begun execution. While that execution is going on, a stack update arrives that adds and removes resources. The set of chunks is now necessarily different; how to coordinate with the chunks in progress?

The alternative, which I favor, is to "atomize" the work: focus on individual resource operations. As each becomes enabled it is anycast ("cast" in oslo RPC terminology) to the convergence engines, one of which picks it up and works on it. The work would be done while holding a lock on the individual resource, to prevent a concurrent engine doing redundant or inconsistent work on that resource. This would be a lock that prevents concurrent execution but not concurrent update of desired state (so that a new stack operation can be received and its desired state persisted while an old one is in progress). When an engine completes work on one resource, it would compute which differences are newly enabled to be executed and do the corresponding anycasts. This outline needs to be adjusted to accommodate desired state changes that arrive during execution.

Engine and Resource Manager Failure

Convergence engines and resource managers (the nova api process, etc.) can fail. We suppose systemd or some such thing detects failures and launches replacement processes as needed. This health subsystem forces exit of a failed process if that has not already happened, and invokes a cleanup utility. The DB schema is expanded to hold a set of active convergence engine IDs. The cleanup utility does a DB transaction that removes the failed process ID from the active set and recovers the work that the failed process was in the process of doing (details below). Engine IDs are composed in such a way that they are not re-used (e.g., host name + PID + timestamp). These things can be done by appropriate configuration of systemd.

With resource operations atomized and anycast, recovery from a convergence engine crash is relatively easy. The work that was in progress when a convergence engine crashed can be found by querying the DB for execution locks held by the crashed engine. The cleanup utility does this query, and anycasts the operations found. When an engine receives an anycast and tries to lock a resource for execution, if that lock is held by a no-longer-active engine then it is implicitly broken so that the active engine can take that lock.

The universal pattern of resource managers is that there is a creation operation that both creates a resource and allocates its one and only unique identifier, returning that identifier. This is inherently problematic because it prevents idempotent usage.

When there is a failure between (1) the time a convergence engine persists its intent to a request creation of a resource and (2) the time the engine persists the UUID of the created resource, it is unclear whether the resource was created. Heat has to assume the resource was not created. This can lead to orphaned resources, and fixing this problem involves changing resource creation APIs to be idempotent.

Stack Adopt

In the ADOPT operation, the input indicates which underlying or "physical" resource corresponds to a given template entry (resource). That means that the association between stack resource and underlying resource is not only a subject of observed state it is also a subject of desired state --- but only in the case of stack adopt.

Roadmap

The first step is changing the DB schema to support the new design.

Once that is done, it will be possible to fix the worst problem with a small patch. The worst problem is that a stack update that fails partway through has no correct DB update it can make; with the DB schema changed, this goes away; we can make stack UPDATE properly update observed state as it goes along, so a failure partway through leaves the DB in a state from which further UPDATE or DELETE operations can correctly proceed.

With the worst problem fixed, we can then turn our attention to the more complete solution.

Heat/ConvergenceDesignNotesByMike

Contents