Jump to: navigation, search

Difference between revisions of "StructuredStateManagement"

(Step 1)
(What problems does this solve in general)
Line 24: Line 24:
 
* Removes hard to discover state+transition dependencies and interactions.
 
* Removes hard to discover state+transition dependencies and interactions.
 
* Ensures state transitions are done reliably and correctly by isolating those transitions to a single place/entity.
 
* Ensures state transitions are done reliably and correctly by isolating those transitions to a single place/entity.
* Fixes a variety of problems that previously had piecemeal like patches applied to attempt to solve them.
+
* Fixes a variety of problems that previously had piece-meal like patches applied to attempt to solve them.
* Eliminates the inherent fragility of the current ad-hoc workflows that exist in nova.
+
* Eliminates the inherent ''fragility'' of the current ad-hoc workflows that exist in nova.
 
** They are by there ad-hoc nature hard to debug, hard to verify, hard to adjust, hard to understand (just hard in general)...
 
** They are by there ad-hoc nature hard to debug, hard to verify, hard to adjust, hard to understand (just hard in general)...
* Makes it possible to audit & track the state transitions performed on a given resource.
+
* Makes it possible to audit & track the state transitions performed on a given resource in a unified manner (note that there currently exists notifications, logging, event reporting...)
** This kind of functionality has started to appear in nova, but the ad-hoc nature was preserved :-(
+
* Removes the need for certain types of periodic tasks and the overused ''set_instance_error_state'' in nova.
 
* Addresses the underlying key point of http://www.slideshare.net/harlowja/nova-states-summit/9 where states will now be fully & automatically recovered from on ''cutting'' events (node failure, resource failure, network failure..).
 
* Addresses the underlying key point of http://www.slideshare.net/harlowja/nova-states-summit/9 where states will now be fully & automatically recovered from on ''cutting'' events (node failure, resource failure, network failure..).
  
 
==== What problems does this solve in nova (on-top of the general ones) ====
 
==== What problems does this solve in nova (on-top of the general ones) ====
  
* Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances, orphaned resources...) left by nova's ad-hoc states.
+
* Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances, orphaned resources...) left by nova's ad-hoc states (which are currently ''repaired'' on a case-by-case basis, instead of by repairing the foundation).
 +
** '''Note:''' This makes it possible to have a more reliable and automatic cleanup process, which will by side-effect increase how large people can grow openstack clusters (since the large you get the more often you will have said errors).
 
* Creates the path for ''smart'' resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow.
 
* Creates the path for ''smart'' resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow.
 
* Makes it possible to do [resizing, live migration] in a more secure and manageable manner.
 
* Makes it possible to do [resizing, live migration] in a more secure and manageable manner.
** Discussion about how this can be done correctly require a intermediary to orchestrate this ownership transfer.
 
 
* Makes it possible for nova to have multi-stage booting where an instances and its dependent resources are first reserved, the resources configured, the instance configured, and then finally the instance is ''powered-on'' (thus completing the instance provisioning process).
 
* Makes it possible for nova to have multi-stage booting where an instances and its dependent resources are first reserved, the resources configured, the instance configured, and then finally the instance is ''powered-on'' (thus completing the instance provisioning process).
  
Line 42: Line 42:
  
 
* https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
 
* https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
 +
* https://bugs.launchpad.net/nova/+bug/1173408
 +
* https://bugs.launchpad.net/nova/+bug/1173413
 +
* https://bugs.launchpad.net/nova/+bug/1173417
 +
* https://bugs.launchpad.net/nova/+bug/1173420
 
* https://bugs.launchpad.net/nova/+bug/1050979
 
* https://bugs.launchpad.net/nova/+bug/1050979
 
* https://bugs.launchpad.net/nova/+bug/1061024
 
* https://bugs.launchpad.net/nova/+bug/1061024
 
* https://bugs.launchpad.net/nova/+bug/1082414
 
* https://bugs.launchpad.net/nova/+bug/1082414
* ...
+
* (and more)
  
 
=== Blueprints ===
 
=== Blueprints ===

Revision as of 23:37, 26 April 2013

Structured state management

Please note that this is a PROPOSAL ONLY. This is not yet 100% implemented.

Goal: Move away from ad-hoc states and state transitions for resource acquisition and modification to a more concrete structured & organized state management system in nova. This new state management system will have advanced new & shiny features such as greater stability, automatic recover mechanisms and greater scalability than what currently exists in nova (and more!).

Definitions

State
The particular condition that someone or something is in at a specific time: "the state of the instance request".
State transition
Altering a state by applying a function on-top of that state (of which said function may take inputs and provide outputs) resulting in a new state.
Task
The application of a state transition on a given state.
Workflow
The sequence of administrative/other processes through which a piece of work passes from initiation to completion.
Orchestrator
An individual or entity that arranges or control the elements of, as to achieve a desired overall effect.

What problems does this solve in general

  • Increases the [stability, extendability, reliability, recoverability] of states and state transitions in nova.
  • Makes it easier to [debug, test, understand, verify, review] states and state transitions in nova.
  • Removes hard to discover state+transition dependencies and interactions.
  • Ensures state transitions are done reliably and correctly by isolating those transitions to a single place/entity.
  • Fixes a variety of problems that previously had piece-meal like patches applied to attempt to solve them.
  • Eliminates the inherent fragility of the current ad-hoc workflows that exist in nova.
    • They are by there ad-hoc nature hard to debug, hard to verify, hard to adjust, hard to understand (just hard in general)...
  • Makes it possible to audit & track the state transitions performed on a given resource in a unified manner (note that there currently exists notifications, logging, event reporting...)
  • Removes the need for certain types of periodic tasks and the overused set_instance_error_state in nova.
  • Addresses the underlying key point of http://www.slideshare.net/harlowja/nova-states-summit/9 where states will now be fully & automatically recovered from on cutting events (node failure, resource failure, network failure..).

What problems does this solve in nova (on-top of the general ones)

  • Removes the need for periodic tasks to cleanup garbage (orphaned instances, orphaned resources...) left by nova's ad-hoc states (which are currently repaired on a case-by-case basis, instead of by repairing the foundation).
    • Note: This makes it possible to have a more reliable and automatic cleanup process, which will by side-effect increase how large people can grow openstack clusters (since the large you get the more often you will have said errors).
  • Creates the path for smart resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow.
  • Makes it possible to do [resizing, live migration] in a more secure and manageable manner.
  • Makes it possible for nova to have multi-stage booting where an instances and its dependent resources are first reserved, the resources configured, the instance configured, and then finally the instance is powered-on (thus completing the instance provisioning process).

Issues that would likely not have happened with a better state management system

Blueprints

Related wikis

https://wiki.openstack.org/wiki/Convection

Potential Requirements

https://etherpad.openstack.org/task-system

Summit Discussions

Havana summit: https://etherpad.openstack.org/the-future-of-orch

Plan of record

Step 1

Create prototype

  1. Create core workflow/task library and prototype using said library in nova for run_instance action.
  2. Split this other action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
  3. Organize chunks into a workflow and test workflow.
  4. Adjust unit tests for each small chunks (depending on what it changes) and add new ones.
  5. Show working prototype at summit session (with associated docs...)

Step 2

  1. Get feedback on prototype from people involved in making it.
  2. Get feedback from summit session.
  3. Get more feedback from email list + other interested parties.

Step 3

  1. Adjust nova prototype as needed from feedback.
  2. Split nova prototype into small chunks.
  3. Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality.
  4. Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).

Step 4

  1. Pick another nova action and refactor it to use design from prototype.
  2. Split this other action (refactored) into small atomic task chunks.
  3. Organize chunks into a workflow and test workflow.
  4. Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality.
  5. Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
  6. Rinse & repeat.

Prototype

https://github.com/Yahoo/NovaOrc

Design

New-arch.png

Details

See: StructuredStateManagementDetails