Difference between revisions of "StructuredStateManagement"

Latest revision as of 20:16, 25 May 2013

Revised on: 5/25/2013 by Harlowja

Move away from ad-hoc states and state transitions for resource acquisition and modification to a more well-defined organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists (and more!).

Definitions

State: The particular condition that someone or something is in at a specific time: "the state of the instance request".
State transition: Altering a state by applying a function on-top of that state (of which said function may take inputs and provide outputs) resulting in a new state.
Task: The application of a state transition on a given state.
Workflow: The sequence of administrative/other processes through which a piece of work passes from initiation to completion.
State management engine: An individual or entity that arranges or control the elements of, as to achieve a desired overall effect.

What problems is this attempting to solve

Increasing the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions.
Making it easier to [ 'add', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions.
Removing hard to discover state & transition dependencies and interactions.
Ensuring state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform said transitions.
Fixing a variety of problems that previously had piecemeal like patches applied to solve them.
Eliminating the inherent fragility of the current ad-hoc workflows.
Allows for upgrading a cloud/software with inflight actions without needing later manual cleanup of said actions.
Makes it possible to audit & track the state transitions performed on a given resource in a unified manner.
- Note: that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner).
Removes the need for certain types of periodic tasks which start to consume more and more resources as you grow your cluster larger.
Moves toward the path where actions will automatically recover on cutting events (node failure, resource failure, network failure..).
Removes the need for periodic tasks to cleanup garbage (orphaned instances/resources/tasks...) left behind.
Creates the foundation for a more reliable and automatic recovery process when errors do occur.
Encourages the ['altering', 'extension'] of default workflows with a more ['complex', 'custom', 'experimental'] workflows.

Fixing a known need

Blueprints

Related papers

Related wikis

Potential Requirements

https://etherpad.openstack.org/task-system

Summit Discussions

Havana summit:

Plan of record

Step 1

Create prototype

Create core workflow/task library and prototype using said library in nova for run_instance action.
Split this action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
Organize chunks into a workflow and test workflow.
Show working prototype at summit session (with associated docs...)

Step 2

Get feedback on prototype from people involved in making it + other interested parties.
Get feedback from summit session.
Get more feedback from email list + other interested parties.
Form group of interested folks that would like to help move forward the prototypes design principles & usage in other core projects.

Step 3

Select single first target project to use new taskflow library.
Incubate taskflow library using parts from prototype inside said project.
1. Keep other active projects involved from the start (so that said library can be easily used there).
Prove library in said single first target project by using it to do a key workflow.
1. Adjust workflow in chunks in said target project to use said library.
Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality.
Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).

Step 4

Move taskflow library (and associated tests) into oslo.

Step 5

Pick another flow and refactor it in said first project and/or pick another interested project for said flow.
Split this other flow into small chunks using said library.
Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality.
Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
Rinse & repeat.

Prototype

https://github.com/Yahoo/NovaOrc

Design

Workflow

More Details!

See: StructuredStateManagementDetails

See: StructuredWorkflowPrimitives

Get involved: StructuredWorkflows

@@ Line 1: / Line 1: @@
-== Structured state management   ==
+Drafter:  [[Harlowja]]
-Please note that this is a PROPOSAL ONLY. This is not yet 100% implemented.
+Revised on: {{REVISIONMONTH1}}/{{REVISIONDAY}}/{{REVISIONYEAR}} by {{REVISIONUSER}}
 === Rationale ===
-Move away from ad-hoc states and state transitions for resource acquisition and modification to a more concrete  structured & organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists  (and more!).
+Move away from ad-hoc states and state transitions for resource acquisition and modification to a more well-defined organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists  (and more!).
 === Definitions ===
@@ Line 17: / Line 17: @@
 ;Workflow
 : The sequence of administrative/other processes through which a piece of work passes from initiation to completion.
-;Orchestrator
+;State management engine
 : An individual or entity that arranges or control the elements of, as to achieve a desired overall effect.
-=== What problems does this solve in general ===
+=== What problems is this attempting to solve ===
-* Increases the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions.
+* Increasing the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions.
-* Makes it easier to [ 'add-new', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions.
+* Making it easier to [ 'add', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions.
-* Removes hard to discover state & transition dependencies and interactions.
+* Removing hard to discover state & transition dependencies and interactions.
-* Ensures state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform those transitions.
+* Ensuring state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform said transitions.
-* Fixes a variety of problems that previously had piecemeal like patches applied to solve them.
+* Fixing a variety of problems that previously had piecemeal like patches applied to solve them.
-* Eliminates the inherent ''fragility'' of the current ad-hoc workflows.
+* Eliminating the inherent ''fragility'' of the current ad-hoc workflows.
-** They are by there ad-hoc nature hard to debug, hard to verify/modify, hard to adjust, hard to understand (just hard in general)...
+* Allows for upgrading a cloud/software with inflight actions without needing later manual cleanup of said  actions.
-* Allows for upgrading a cloud with inflight actions without needing manual cleanup of those inflight actions.
 * Makes it possible to audit & track the state transitions performed on a given resource in a unified manner.
 ** '''Note:''' that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner).
-* Removes the need for certain types of periodic tasks
+* Removes the need for certain types of periodic tasks which start to consume more and more resources as  you grow your cluster larger.
-** This by side-effect increases scale  since said periodic tasks start to consume more and more resources as  you get bigger.
+* Moves toward the path where actions will automatically recover on ''cutting'' events (node failure, resource failure, network failure..).
-* Addresses the underlying key point of http://www.slideshare.net/harlowja/nova-states-summit/9 where states will now be fully & automatically recovered from on ''cutting'' events (node failure, resource failure, network failure..).
+* Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances/resources/tasks...) left behind.
+* Creates the foundation for a more reliable and automatic recovery process when errors do occur.
+* Encourages the ['altering', 'extension'] of default workflows with a more ['complex', 'custom', 'experimental'] workflows.
-==== What problems does this solve in nova (on-top of the general ones) ====
+==== Fixing a known need ====
-* Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances/resources...) left by nova's ad-hoc states (which are currently ''repaired'' on a case-by-case basis, instead of by repairing the foundation).
-* Removes the usage of the overused ''set_instance_error_state'' function in nova (or at least decreases its usage).
-* Creates the foundation for a more reliable and automatic recovery process when errors do occur
-** This by side-effect increases how large people can grow openstack clusters (since with scale comes the potential for more errors).
-* Creates the path for ''smart'' resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow.
-* Makes it possible to do ['live migration', 'resizing'] in a more secure and manageable manner.
-==== Issues that would likely not have happened with a better state management system ====
 * https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
+* https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine
 * https://bugs.launchpad.net/nova/+bug/1173408
 * https://bugs.launchpad.net/nova/+bug/1173413
@@ Line 62: / Line 55: @@
 * https://blueprints.launchpad.net/nova/+spec/structured-state-management
-** https://blueprints.launchpad.net/nova/+spec/structured-state-management-core-library
+* https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine
-** https://blueprints.launchpad.net/nova/+spec/structured-state-management-run-instance-path
+=== Related papers ===
+* http://www.netdb.cis.upenn.edu/papers/tropic_tr.pdf
+* http://research.microsoft.com/pubs/64604/osr2007.pdf
 === Related wikis ===
 * https://wiki.openstack.org/wiki/Convection
-* https://wiki.openstack.org/wiki/NovaOrchestration/WorkflowEngines
+* https://wiki.openstack.org/wiki/NovaOrchestration/WorkflowEngines ('''old''')
 === Potential Requirements ===
@@ Line 76: / Line 73: @@
 === Summit Discussions ===
-Havana summit: https://etherpad.openstack.org/the-future-of-orch
+'''Havana summit:'''
+* https://etherpad.openstack.org/the-future-of-orch
+* https://etherpad.openstack.org/Summit-Havana-Cinder-Safe-Shutdown
 === Plan of record ===
@@ Line 85: / Line 84: @@
 # Create core workflow/task library and prototype using said library in nova for ''run_instance'' action.
-# Split this other action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
+# Split this action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
 # Organize chunks into a workflow and test workflow.
 # Show working prototype at summit session (with associated docs...)
@@ Line 91: / Line 90: @@
 ==== Step 2 ====
-# Get feedback  on prototype from people involved in making it.
+# Get feedback on prototype from people involved in making it + other interested parties.
 # Get feedback from summit session.
 # Get more feedback from email list + other interested parties.
+# Form group of interested folks that would like to help move forward the prototypes design principles & usage in other core projects.
 ==== Step 3 ====
-# Adjust nova prototype as needed from feedback.
+# Select single first target project to use new ''taskflow'' library.
-# Split nova prototype into small chunks.
+# Incubate ''taskflow'' library using parts from prototype inside said project.
+## Keep other active projects involved from the start (so that said library can be easily used there).
+# Prove library in said single first target project by using it to do a key workflow.
+## Adjust workflow in chunks in said target project to use said library.
 # Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality.
 # Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
@@ Line 104: / Line 107: @@
 ==== Step 4 ====
-# Pick another nova action and refactor it to use design from prototype.
+# Move ''taskflow'' library (and associated tests) into oslo.
-# Split this other action (refactored) into small atomic task chunks.
-# Organize chunks into a workflow and test workflow.
+==== Step 5 ====
+# Pick another flow and refactor it in said first project ''and/or'' pick another interested project for said flow.
+# Split this other flow into small chunks using said library.
 # Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality.
 # Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
 # Rinse & repeat.
 == Prototype ==
@@ Line 117: / Line 121: @@
 https://github.com/Yahoo/NovaOrc
-=== Prototype Design ===
+=== Design ===
 [[File:New-arch.png|thumbnail|center]]
-== Prototype Workflow ==
+===  Workflow ===
 [[File:Run_workflow.png|thumbnail|center]]
 == More  Details! ==
 See: [[StructuredStateManagementDetails]]
+See: [[StructuredWorkflowPrimitives]]
+Get involved: [[StructuredWorkflows]]