Difference between revisions of "StructuredStateManagement"
(→What problems does this solve in nova (on-top of the general ones)) |
|||
(24 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | + | Drafter: [[Harlowja]] | |
− | + | Revised on: {{REVISIONMONTH1}}/{{REVISIONDAY}}/{{REVISIONYEAR}} by {{REVISIONUSER}} | |
− | |||
− | |||
=== Rationale === | === Rationale === | ||
Line 19: | Line 17: | ||
;Workflow | ;Workflow | ||
: The sequence of administrative/other processes through which a piece of work passes from initiation to completion. | : The sequence of administrative/other processes through which a piece of work passes from initiation to completion. | ||
− | ; | + | ;State management engine |
: An individual or entity that arranges or control the elements of, as to achieve a desired overall effect. | : An individual or entity that arranges or control the elements of, as to achieve a desired overall effect. | ||
− | === What problems | + | === What problems is this attempting to solve === |
− | * | + | * Increasing the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions. |
− | * | + | * Making it easier to [ 'add', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions. |
− | * | + | * Removing hard to discover state & transition dependencies and interactions. |
− | * | + | * Ensuring state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform said transitions. |
− | * | + | * Fixing a variety of problems that previously had piecemeal like patches applied to solve them. |
− | * | + | * Eliminating the inherent ''fragility'' of the current ad-hoc workflows. |
− | + | * Allows for upgrading a cloud/software with inflight actions without needing later manual cleanup of said actions. | |
− | * Allows for upgrading a cloud with inflight actions without needing manual cleanup of | ||
* Makes it possible to audit & track the state transitions performed on a given resource in a unified manner. | * Makes it possible to audit & track the state transitions performed on a given resource in a unified manner. | ||
** '''Note:''' that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner). | ** '''Note:''' that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner). | ||
− | * Removes the need for certain types of periodic tasks | + | * Removes the need for certain types of periodic tasks which start to consume more and more resources as you grow your cluster larger. |
− | + | * Moves toward the path where actions will automatically recover on ''cutting'' events (node failure, resource failure, network failure..). | |
− | * | ||
− | |||
− | |||
− | |||
* Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances/resources/tasks...) left behind. | * Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances/resources/tasks...) left behind. | ||
− | |||
* Creates the foundation for a more reliable and automatic recovery process when errors do occur. | * Creates the foundation for a more reliable and automatic recovery process when errors do occur. | ||
− | * | + | * Encourages the ['altering', 'extension'] of default workflows with a more ['complex', 'custom', 'experimental'] workflows. |
− | |||
− | ==== | + | ==== Fixing a known need ==== |
* https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service | * https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service | ||
+ | * https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine | ||
* https://bugs.launchpad.net/nova/+bug/1173408 | * https://bugs.launchpad.net/nova/+bug/1173408 | ||
* https://bugs.launchpad.net/nova/+bug/1173413 | * https://bugs.launchpad.net/nova/+bug/1173413 | ||
Line 63: | Line 55: | ||
* https://blueprints.launchpad.net/nova/+spec/structured-state-management | * https://blueprints.launchpad.net/nova/+spec/structured-state-management | ||
− | + | * https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine | |
− | |||
=== Related papers === | === Related papers === | ||
* http://www.netdb.cis.upenn.edu/papers/tropic_tr.pdf | * http://www.netdb.cis.upenn.edu/papers/tropic_tr.pdf | ||
+ | * http://research.microsoft.com/pubs/64604/osr2007.pdf | ||
=== Related wikis === | === Related wikis === | ||
* https://wiki.openstack.org/wiki/Convection | * https://wiki.openstack.org/wiki/Convection | ||
− | * https://wiki.openstack.org/wiki/NovaOrchestration/WorkflowEngines | + | * https://wiki.openstack.org/wiki/NovaOrchestration/WorkflowEngines ('''old''') |
=== Potential Requirements === | === Potential Requirements === | ||
Line 81: | Line 73: | ||
=== Summit Discussions === | === Summit Discussions === | ||
− | Havana summit: https://etherpad.openstack.org/the-future-of-orch | + | '''Havana summit:''' |
+ | * https://etherpad.openstack.org/the-future-of-orch | ||
+ | * https://etherpad.openstack.org/Summit-Havana-Cinder-Safe-Shutdown | ||
=== Plan of record === | === Plan of record === | ||
Line 90: | Line 84: | ||
# Create core workflow/task library and prototype using said library in nova for ''run_instance'' action. | # Create core workflow/task library and prototype using said library in nova for ''run_instance'' action. | ||
− | # Split this | + | # Split this action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype). |
# Organize chunks into a workflow and test workflow. | # Organize chunks into a workflow and test workflow. | ||
# Show working prototype at summit session (with associated docs...) | # Show working prototype at summit session (with associated docs...) | ||
Line 96: | Line 90: | ||
==== Step 2 ==== | ==== Step 2 ==== | ||
− | # Get feedback | + | # Get feedback on prototype from people involved in making it + other interested parties. |
# Get feedback from summit session. | # Get feedback from summit session. | ||
# Get more feedback from email list + other interested parties. | # Get more feedback from email list + other interested parties. | ||
+ | # Form group of interested folks that would like to help move forward the prototypes design principles & usage in other core projects. | ||
==== Step 3 ==== | ==== Step 3 ==== | ||
− | # | + | # Select single first target project to use new ''taskflow'' library. |
− | # | + | # Incubate ''taskflow'' library using parts from prototype inside said project. |
+ | ## Keep other active projects involved from the start (so that said library can be easily used there). | ||
+ | # Prove library in said single first target project by using it to do a key workflow. | ||
+ | ## Adjust workflow in chunks in said target project to use said library. | ||
# Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality. | # Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality. | ||
# Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?). | # Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?). | ||
Line 109: | Line 107: | ||
==== Step 4 ==== | ==== Step 4 ==== | ||
− | # Pick another | + | # Move ''taskflow'' library (and associated tests) into oslo. |
− | # Split this other | + | |
− | + | ==== Step 5 ==== | |
+ | |||
+ | # Pick another flow and refactor it in said first project ''and/or'' pick another interested project for said flow. | ||
+ | # Split this other flow into small chunks using said library. | ||
# Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality. | # Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality. | ||
# Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?). | # Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?). | ||
Line 128: | Line 129: | ||
== More Details! == | == More Details! == | ||
+ | |||
See: [[StructuredStateManagementDetails]] | See: [[StructuredStateManagementDetails]] | ||
+ | |||
+ | See: [[StructuredWorkflowPrimitives]] | ||
+ | |||
+ | Get involved: [[StructuredWorkflows]] |
Latest revision as of 20:16, 25 May 2013
Drafter: Harlowja
Revised on: 5/25/2013 by Harlowja
Contents
Rationale
Move away from ad-hoc states and state transitions for resource acquisition and modification to a more well-defined organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists (and more!).
Definitions
- State
- The particular condition that someone or something is in at a specific time: "the state of the instance request".
- State transition
- Altering a state by applying a function on-top of that state (of which said function may take inputs and provide outputs) resulting in a new state.
- Task
- The application of a state transition on a given state.
- Workflow
- The sequence of administrative/other processes through which a piece of work passes from initiation to completion.
- State management engine
- An individual or entity that arranges or control the elements of, as to achieve a desired overall effect.
What problems is this attempting to solve
- Increasing the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions.
- Making it easier to [ 'add', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions.
- Removing hard to discover state & transition dependencies and interactions.
- Ensuring state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform said transitions.
- Fixing a variety of problems that previously had piecemeal like patches applied to solve them.
- Eliminating the inherent fragility of the current ad-hoc workflows.
- Allows for upgrading a cloud/software with inflight actions without needing later manual cleanup of said actions.
- Makes it possible to audit & track the state transitions performed on a given resource in a unified manner.
- Note: that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner).
- Removes the need for certain types of periodic tasks which start to consume more and more resources as you grow your cluster larger.
- Moves toward the path where actions will automatically recover on cutting events (node failure, resource failure, network failure..).
- Removes the need for periodic tasks to cleanup garbage (orphaned instances/resources/tasks...) left behind.
- Creates the foundation for a more reliable and automatic recovery process when errors do occur.
- Encourages the ['altering', 'extension'] of default workflows with a more ['complex', 'custom', 'experimental'] workflows.
Fixing a known need
- https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
- https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine
- https://bugs.launchpad.net/nova/+bug/1173408
- https://bugs.launchpad.net/nova/+bug/1173413
- https://bugs.launchpad.net/nova/+bug/1173417
- https://bugs.launchpad.net/nova/+bug/1173420
- https://bugs.launchpad.net/nova/+bug/1050979
- https://bugs.launchpad.net/nova/+bug/1061024
- https://bugs.launchpad.net/nova/+bug/1082414
- https://bugs.launchpad.net/nova/+bug/1173429
- https://bugs.launchpad.net/nova/+bug/1173430
- (and more)
Blueprints
- https://blueprints.launchpad.net/nova/+spec/structured-state-management
- https://blueprints.launchpad.net/cinder/+spec/cinder-state-machine
Related papers
- http://www.netdb.cis.upenn.edu/papers/tropic_tr.pdf
- http://research.microsoft.com/pubs/64604/osr2007.pdf
Related wikis
- https://wiki.openstack.org/wiki/Convection
- https://wiki.openstack.org/wiki/NovaOrchestration/WorkflowEngines (old)
Potential Requirements
https://etherpad.openstack.org/task-system
Summit Discussions
Havana summit:
- https://etherpad.openstack.org/the-future-of-orch
- https://etherpad.openstack.org/Summit-Havana-Cinder-Safe-Shutdown
Plan of record
Step 1
Create prototype
- Create core workflow/task library and prototype using said library in nova for run_instance action.
- Split this action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
- Organize chunks into a workflow and test workflow.
- Show working prototype at summit session (with associated docs...)
Step 2
- Get feedback on prototype from people involved in making it + other interested parties.
- Get feedback from summit session.
- Get more feedback from email list + other interested parties.
- Form group of interested folks that would like to help move forward the prototypes design principles & usage in other core projects.
Step 3
- Select single first target project to use new taskflow library.
- Incubate taskflow library using parts from prototype inside said project.
- Keep other active projects involved from the start (so that said library can be easily used there).
- Prove library in said single first target project by using it to do a key workflow.
- Adjust workflow in chunks in said target project to use said library.
- Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality.
- Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
Step 4
- Move taskflow library (and associated tests) into oslo.
Step 5
- Pick another flow and refactor it in said first project and/or pick another interested project for said flow.
- Split this other flow into small chunks using said library.
- Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality.
- Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
- Rinse & repeat.
Prototype
https://github.com/Yahoo/NovaOrc
Design
Workflow
More Details!
See: StructuredStateManagementDetails
See: StructuredWorkflowPrimitives
Get involved: StructuredWorkflows