StructuredStateManagement

Drafter: Harlowja

Revised on: // by

Rationale
Move away from ad-hoc states and state transitions for resource acquisition and modification to a more well-defined organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists (and more!).

Definitions

 * State
 * The particular condition that someone or something is in at a specific time: "the state of the instance request".


 * State transition
 * Altering a state by applying a function on-top of that state (of which said function may take inputs and provide outputs) resulting in a new state.


 * Task
 * The application of a state transition on a given state.


 * Workflow
 * The sequence of administrative/other processes through which a piece of work passes from initiation to completion.


 * State management engine
 * An individual or entity that arranges or control the elements of, as to achieve a desired overall effect.

What problems is this attempting to solve

 * Increasing the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions.
 * Making it easier to [ 'add', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions.
 * Removing hard to discover state & transition dependencies and interactions.
 * Ensuring state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform said transitions.
 * Fixing a variety of problems that previously had piecemeal like patches applied to solve them.
 * Eliminating the inherent fragility of the current ad-hoc workflows.
 * Allows for upgrading a cloud/software with inflight actions without needing later manual cleanup of said actions.
 * Makes it possible to audit & track the state transitions performed on a given resource in a unified manner.
 * Note: that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner).
 * Removes the need for certain types of periodic tasks which start to consume more and more resources as you grow your cluster larger.
 * Moves toward the path where actions will automatically recover on cutting events (node failure, resource failure, network failure..).
 * Removes the need for periodic tasks to cleanup garbage (orphaned instances/resources/tasks...) left behind.
 * Creates the foundation for a more reliable and automatic recovery process when errors do occur.
 * Encourages the ['altering', 'extension'] of default workflows with a more ['complex', 'custom', 'experimental'] workflows.

Fixing a known need

 * https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
 * https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine
 * https://bugs.launchpad.net/nova/+bug/1173408
 * https://bugs.launchpad.net/nova/+bug/1173413
 * https://bugs.launchpad.net/nova/+bug/1173417
 * https://bugs.launchpad.net/nova/+bug/1173420
 * https://bugs.launchpad.net/nova/+bug/1050979
 * https://bugs.launchpad.net/nova/+bug/1061024
 * https://bugs.launchpad.net/nova/+bug/1082414
 * https://bugs.launchpad.net/nova/+bug/1173429
 * https://bugs.launchpad.net/nova/+bug/1173430
 * (and more)

Blueprints

 * https://blueprints.launchpad.net/nova/+spec/structured-state-management
 * https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine

Related papers

 * http://www.netdb.cis.upenn.edu/papers/tropic_tr.pdf
 * http://research.microsoft.com/pubs/64604/osr2007.pdf

Related wikis

 * https://wiki.openstack.org/wiki/Convection
 * https://wiki.openstack.org/wiki/NovaOrchestration/WorkflowEngines (old)

Potential Requirements
https://etherpad.openstack.org/task-system

Summit Discussions
Havana summit:
 * https://etherpad.openstack.org/the-future-of-orch
 * https://etherpad.openstack.org/Summit-Havana-Cinder-Safe-Shutdown

Step 1
Create prototype


 * 1) Create core workflow/task library and prototype using said library in nova for run_instance action.
 * 2) Split this action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
 * 3) Organize chunks into a workflow and test workflow.
 * 4) Show working prototype at summit session (with associated docs...)

Step 2

 * 1) Get feedback on prototype from people involved in making it + other interested parties.
 * 2) Get feedback from summit session.
 * 3) Get more feedback from email list + other interested parties.
 * 4) Form group of interested folks that would like to help move forward the prototypes design principles & usage in other core projects.

Step 3

 * 1) Select single first target project to use new taskflow library.
 * 2) Incubate taskflow library using parts from prototype inside said project.
 * 3) Keep other active projects involved from the start (so that said library can be easily used there).
 * 4) Prove library in said single first target project by using it to do a key workflow.
 * 5) Adjust workflow in chunks in said target project to use said library.
 * 6) Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality.
 * 7) Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).

Step 4

 * 1) Move taskflow library (and associated tests) into oslo.

Step 5

 * 1) Pick another flow and refactor it in said first project and/or pick another interested project for said flow.
 * 2) Split this other flow into small chunks using said library.
 * 3) Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality.
 * 4) Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
 * 5) Rinse & repeat.

Prototype
https://github.com/Yahoo/NovaOrc

More Details!
See: StructuredStateManagementDetails

See: StructuredWorkflowPrimitives

Get involved: StructuredWorkflows