Convergence Design Notes by Mike
In the Juno design summit etherpad (https://etherpad.openstack.org/p/heat-workflow-vs-convergence) there are several design problems and solutions discussed, with final selection unclear. In this page I offer one opinion. The contentious issues include: kinds of agents, to use a workflow engine or not, where to store observed state, whether and (if so) how to chunk the work, and how to handle stack adopt. In another dimension, there is the question of how to roadmap the work.
Kinds of Agents
I like to think of three kinds of agents: desired state setters, observers, and convergence engines. Today's heat engines can evolve to set desired state, and possibly to also serve as convergence engines.
Workflow Engine or Not
The desires to support very large stacks (size of 1E6 resources was mentioned), react quickly to new stack operations, and efficiently support a hypothetical new incremental stack update operation appear to have side-tracked the idea of using a workflow engine. Here is my net of the issues. Suppose a workflow engine with a clean interrupt operation: it stops the launch of new actions for the interrupted workflow, and waits for completion of the actions currently in progress. Heat could use such a workflow engine. When some differences between desired and observed state are detected, heat would compose a workflow to heal the differences and launch it. If later more differences are detected before that workflow completes, that workflow would be interrupted and then a new workflow composed and launched to handle all the current differences. This would probably require yet another copy of state: in addition to desired and observed state there would be another kind of state that is observed state overwritten by the goals of the workflow (if any) currently in progress; let's call that "anticipated state". It would actually be anticipated state, rather than observed state, that is compared with desired state to drive workflow composition and launch.
There is a desire to add an incremental stack update operation, which would not take a whole revised template+effective_environment+parameters but rather some description of an incremental change. There is a desire for an efficient implementation of this hypothetical operation, particularly in the case of a large stack and a small change. It might be possible for the implementation to interrupt a workflow in progress and incrementally compute the needed revised workflow, and possibly even --- as an optimization --- detect the special case where there is no intersection between the new delta and the workflow(s) currently in progress and in that case compose and launch an additional independent workflow. But this is going to be pretty complex logic.
An alternative approach does not use a workflow engine; rather, multiple convergence engines conspire to do the actions themselves. This appears preferable to me, although I am not happy with the degree to which this involves duplicating functionality of a workflow system.
Where to Store Observed State
I saw mention of three approaches: (1) store it in the DB, (2) store it in memcached, and (3) read it whenever needed from the authoritative source. Approach (3) was roundly dismissed. Approach (2) raises questions about consistency. It is not clear to me that we have a problem with consistency, that depends on other parts of the design. For now let us assume approach (1), and later revisit this question.
To Chunk or Not
Some of the discussion was around the idea of breaking a large stack operation into smallish batches of individual resource operations. There is even a reference to an academic paper on partitioning an undirected graph, which could be used to do that breaking into batches. Even with batching (chunking) there remains the problem of doing each operation only after its dependencies are satisfied. In general there can be many dependencies that run from the midst of one chunk into the midst of another; a chunk can not be executed independently of the other chunks. With an alternate partitioning algorithm that takes a DAG as input and produces a partitioning with no cycles we could coarsen the dependencies to whole partitions and execute each partition in isolation but that (a) involves a harder partitioning problem and (b) introduces delays in execution.
With multiple engines there is also the problem of avoiding duplicated or inconsistent work (remember the desire for the ability to start healing a new divergence before an old one is fully healed). It would not be hard to ensure that each chunk is processed by exactly one convergence engine.
Suppose one large stack operation arrives, is broken into chunks, and they are distributed and begun execution. While that execution is going on, a stack update or relevant change in observed state arrives. The set of chunks may now be different; how to coordinate with the chunks in progress? The likely answer is to interrupt all the old chunks and then compose and launch a new set of chunks. (This itself is not a differentiator, it is the essence of the easy thing to do in all approaches.)
The alternative, which I favor, is to "atomize" the work: focus on individual resource operations. As each becomes enabled it is anycast ("cast" in oslo RPC terminology) to the convergence engines, one of which picks it up and works on it. The work would be done while holding a lock on the individual resource, to prevent a concurrent engine doing redundant or inconsistent work on that resource. This would be a lock that prevents concurrent execution but not concurrent update of desired state (so that a new stack operation can be received and its desired state persisted while an old one is in progress). When an engine completes work on one resource, it would compute which differences are newly enabled to be executed and do the corresponding anycasts. When a divergence (i.e., set of differences between desired and observed state) is first noted, the agent that noted it anycasts the initial set of operations (those that do not need to wait on any dependencies).
There needs to be a way to tell when all the work to heal the last detected divergence of a given stack has terminated. Keeping critical execution state in message queues is problematic because they do not support the needed querying. Therefore the set of enabled resource operations is kept in a DB table (including a column that identifies the stack), and what is anycast is the ID of the stack that has work waiting in that table. When that table holds no rows for a given stack and the stack's divergence_waiting field is set to null (see below), that stack's last detected divergence has been healed.
If a new divergence is detected before completion of the work triggered by an old divergence, we interrupt the old work (this includes waiting for it to be fully stopped) and then start working on the now-current set of differences. Following is a way to interrupt the old work and know when it is fully stopped. We add to the DB a table that maps convergence engine ID to the ID of the stack on which that engine is currently working or null when the engine is not currently working on anything; an engine updates this at the start and end of each operation. We also enhance the stack table to have a column indicating whether the stack has a new divergence waiting; it can hold one of three values: WAIT, CAST, and null. When some agent notes a new divergence for a given stack, the agent deletes all that stack's existing records (if any) in the enabled operations table and sets the stack's divergence_waiting field to WAIT. Then that agent waits until the DB shows that no engine is working on that stack. Then that agent does a transaction that sets the stack's divergence_waiting field to CAST and creates the initially-enabled set of records for that divergence; then that agent does the anycasts for the new records; then that agent does a DB transaction that sets the stack's divergence_waiting field to null. At startup of an agent that does this sort of thing, the agent first recovers by looking for stacks whose divergence_waiting field is set to WAIT or CAST and proceeding from there.
Engine and Resource Manager Failure
Convergence engines and resource managers (the nova api process, etc.) can fail. We suppose systemd or some such thing detects failures and launches replacement processes as needed. This health subsystem forces exit of a failed process if that has not already happened, and invokes a cleanup utility. The DB schema is expanded to hold a set of active convergence engine IDs. The cleanup utility does a DB transaction that removes the failed process ID from the active set and recovers the work that the failed process was in the process of doing (details below). Engine IDs are composed in such a way that they are not re-used (e.g., host name + PID + timestamp). These things can be done by appropriate configuration of systemd.
With resource operations atomized and anycast, recovery from a convergence engine crash is relatively easy. The work that was in progress when a convergence engine crashed can be found by querying the DB for execution locks held by the crashed engine. The cleanup utility does this query, and anycasts the operations found. When an engine receives an anycast and tries to lock a resource for execution, if that lock is held by a no-longer-active engine then it is implicitly broken so that the active engine can take that lock.
The universal pattern of resource managers is that there is a creation operation that both creates a resource and allocates its one and only unique identifier, returning that identifier. This is inherently problematic because it prevents idempotent usage.
When there is a failure between (1) the time a convergence engine persists its intent to a request creation of a resource and (2) the time the engine persists the UUID of the created resource, it is unclear whether the resource was created. Heat has to assume the resource was not created. This can lead to orphaned resources, and fixing this problem involves changing resource creation APIs to be idempotent.
In the ADOPT operation, the input indicates which underlying or "physical" resource corresponds to a given template entry (resource). That means that the association between stack resource and underlying resource is not only a subject of observed state it is also a subject of desired state --- but only in the case of stack adopt.
The first step is changing the DB schema to hold both desired and observed state.
Once that is done, it will be possible to fix the worst problem with a small patch. The worst problem is that a stack update that fails partway through has no correct DB update it can make; with the DB schema changed, this goes away; we can make stack UPDATE properly update observed state as it goes along, so a failure partway through leaves the DB in a state from which further UPDATE or DELETE operations can correctly proceed.
With the worst problem fixed, we can then turn our attention to the more complete solution.