Difference between revisions of "StructuredStateManagement"
(→Structured state management) |
(→Structured state management) |
||
Line 3: | Line 3: | ||
Please note that this is a PROPOSAL ONLY. This is not yet 100% implemented. | Please note that this is a PROPOSAL ONLY. This is not yet 100% implemented. | ||
− | + | === Rationale === | |
+ | |||
+ | Move away from ad-hoc states and state transitions for resource acquisition and modification to a more concrete structured & organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists (and more!). | ||
=== Definitions === | === Definitions === | ||
Line 20: | Line 22: | ||
=== What problems does this solve in general === | === What problems does this solve in general === | ||
− | * Increases the [ | + | * Increases the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions. |
− | * Makes it easier to [debug, test, understand, verify | + | * Makes it easier to [ 'add-new', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions. |
− | * Removes hard to discover state | + | * Removes hard to discover state & transition dependencies and interactions. |
− | * Ensures state transitions are done reliably | + | * Ensures state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform those transitions. |
− | * Fixes a variety of problems that previously had | + | * Fixes a variety of problems that previously had piecemeal like patches applied to solve them. |
* Eliminates the inherent ''fragility'' of the current ad-hoc workflows. | * Eliminates the inherent ''fragility'' of the current ad-hoc workflows. | ||
− | ** They are by there ad-hoc nature hard to debug, hard to verify, hard to adjust, hard to understand (just hard in general)... | + | ** They are by there ad-hoc nature hard to debug, hard to verify/modify, hard to adjust, hard to understand (just hard in general)... |
− | * Makes it possible to audit & track the state transitions performed on a given resource in a unified manner | + | * Allows for upgrading a cloud with inflight actions without needing manual cleanup of those inflight actions. |
+ | * Makes it possible to audit & track the state transitions performed on a given resource in a unified manner. | ||
** '''Note:''' that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner). | ** '''Note:''' that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner). | ||
* Removes the need for certain types of periodic tasks (this by side-effect increases scale since said periodic tasks start to consume more and more resources as you get bigger). | * Removes the need for certain types of periodic tasks (this by side-effect increases scale since said periodic tasks start to consume more and more resources as you get bigger). | ||
Line 34: | Line 37: | ||
==== What problems does this solve in nova (on-top of the general ones) ==== | ==== What problems does this solve in nova (on-top of the general ones) ==== | ||
− | * Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances | + | * Removes the need for periodic tasks to cleanup ''garbage'' (orphaned instances/resources...) left by nova's ad-hoc states (which are currently ''repaired'' on a case-by-case basis, instead of by repairing the foundation). |
* Removes the usage of the overused ''set_instance_error_state'' function in nova (or at least decreases its usage). | * Removes the usage of the overused ''set_instance_error_state'' function in nova (or at least decreases its usage). | ||
− | * Creates the foundation for a more reliable and automatic recovery process when errors do occur | + | * Creates the foundation for a more reliable and automatic recovery process when errors do occur |
+ | ** This will by side-effect increase how large people can grow openstack clusters (since with scale comes the potential for more errors). | ||
* Creates the path for ''smart'' resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow. | * Creates the path for ''smart'' resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow. | ||
− | * Makes it possible to do [ | + | * Makes it possible to do ['live migration', 'resizing'] in a more secure and manageable manner. |
− | |||
==== Issues that would likely not have happened with a better state management system ==== | ==== Issues that would likely not have happened with a better state management system ==== |
Revision as of 02:27, 27 April 2013
Contents
Structured state management
Please note that this is a PROPOSAL ONLY. This is not yet 100% implemented.
Rationale
Move away from ad-hoc states and state transitions for resource acquisition and modification to a more concrete structured & organized state management system. This new state management system will have advanced new & shiny features such as greater stability, automatic recovery mechanisms and greater scalability than what currently exists (and more!).
Definitions
- State
- The particular condition that someone or something is in at a specific time: "the state of the instance request".
- State transition
- Altering a state by applying a function on-top of that state (of which said function may take inputs and provide outputs) resulting in a new state.
- Task
- The application of a state transition on a given state.
- Workflow
- The sequence of administrative/other processes through which a piece of work passes from initiation to completion.
- Orchestrator
- An individual or entity that arranges or control the elements of, as to achieve a desired overall effect.
What problems does this solve in general
- Increases the ['extendability', 'recoverability', 'reliability', 'stability'] of states and state transitions.
- Makes it easier to [ 'add-new', 'debug', 'review', 'test', 'understand', 'verify'] existing & new states and state transitions.
- Removes hard to discover state & transition dependencies and interactions.
- Ensures state transitions are done ['correctly', 'reliably'] by isolating those transitions to a entity whose exclusive responsibility is to ['correctly', 'reliably'] perform those transitions.
- Fixes a variety of problems that previously had piecemeal like patches applied to solve them.
- Eliminates the inherent fragility of the current ad-hoc workflows.
- They are by there ad-hoc nature hard to debug, hard to verify/modify, hard to adjust, hard to understand (just hard in general)...
- Allows for upgrading a cloud with inflight actions without needing manual cleanup of those inflight actions.
- Makes it possible to audit & track the state transitions performed on a given resource in a unified manner.
- Note: that there currently exists notifications, logging, event reporting as different mechanisms (which are not used in a uniform manner).
- Removes the need for certain types of periodic tasks (this by side-effect increases scale since said periodic tasks start to consume more and more resources as you get bigger).
- Addresses the underlying key point of http://www.slideshare.net/harlowja/nova-states-summit/9 where states will now be fully & automatically recovered from on cutting events (node failure, resource failure, network failure..).
What problems does this solve in nova (on-top of the general ones)
- Removes the need for periodic tasks to cleanup garbage (orphaned instances/resources...) left by nova's ad-hoc states (which are currently repaired on a case-by-case basis, instead of by repairing the foundation).
- Removes the usage of the overused set_instance_error_state function in nova (or at least decreases its usage).
- Creates the foundation for a more reliable and automatic recovery process when errors do occur
- This will by side-effect increase how large people can grow openstack clusters (since with scale comes the potential for more errors).
- Creates the path for smart resource scheduling by allowing the altering and/or replacement of the scheduling workflow with a more complex workflow.
- Makes it possible to do ['live migration', 'resizing'] in a more secure and manageable manner.
Issues that would likely not have happened with a better state management system
- https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
- https://bugs.launchpad.net/nova/+bug/1173408
- https://bugs.launchpad.net/nova/+bug/1173413
- https://bugs.launchpad.net/nova/+bug/1173417
- https://bugs.launchpad.net/nova/+bug/1173420
- https://bugs.launchpad.net/nova/+bug/1050979
- https://bugs.launchpad.net/nova/+bug/1061024
- https://bugs.launchpad.net/nova/+bug/1082414
- https://bugs.launchpad.net/nova/+bug/1173429
- https://bugs.launchpad.net/nova/+bug/1173430
- (and more)
Blueprints
Related wikis
https://wiki.openstack.org/wiki/Convection
Potential Requirements
https://etherpad.openstack.org/task-system
Summit Discussions
Havana summit: https://etherpad.openstack.org/the-future-of-orch
Plan of record
Step 1
Create prototype
- Create core workflow/task library and prototype using said library in nova for run_instance action.
- Split this other action (refactored) into small atomic task chunks (don't aim for perfection just yet, since its a prototype).
- Organize chunks into a workflow and test workflow.
- Show working prototype at summit session (with associated docs...)
Step 2
- Get feedback on prototype from people involved in making it.
- Get feedback from summit session.
- Get more feedback from email list + other interested parties.
Step 3
- Adjust nova prototype as needed from feedback.
- Split nova prototype into small chunks.
- Adjust tests for each small chunks (depending on what it changes) and add new ones for new functionality.
- Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
Step 4
- Pick another nova action and refactor it to use design from prototype.
- Split this other action (refactored) into small atomic task chunks.
- Organize chunks into a workflow and test workflow.
- Adjust unit tests for each small chunks (depending on what it changes) and add new ones for new functionality.
- Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
- Rinse & repeat.
Prototype
https://github.com/Yahoo/NovaOrc