Difference between revisions of "StructuredStateManagement"

Revision as of 01:04, 25 April 2013

Summary

Move away from ad-hoc states and state transitions for resource acquisition and modification to a more concrete organized structured state management system in nova.

What problems does this solve in general

Increases the [stability, extendability, reliability] of the various openstack projects.
Makes it easier to [debug, test, understand, verify, review] the projects which have a workflow-like concept.
Removes hard to discover state-transition dependencies and interactions with clearly defined state-transition dependencies and interactions.
Ensures state transitions are done reliably and correctly by isolating those transitions to a single place/entity.
Fixes a variety of problems that previously had piecemeal like patches applied to attempt to solve them (avoiding fixing the larger problem).
Eliminates the inherent fragility of the current ad-hoc workflows that exist in the openstack projects.
- They are by there ad-hoc nature hard to debug, hard to verify, hard to adjust, hard to understand (just hard in general)...
Makes it possible to audit & track the state transitions performed on a given resource.
- This kind of functionality has started to appear in nova, but the ad-hoc nature was preserved :-(
  - See: https://github.com/openstack/nova/blob/stable/grizzly/nova/compute/utils.py#L305
Addresses the underlying key point of http://www.slideshare.net/harlowja/nova-states-summit/9 where states will now be fully recovered from on cutting.

What problems does this solve in nova (+ the general ones)

Removes the need for periodic tasks to cleanup garbage (orphaned instances, orphaned resources...) left by nova's ad-hoc states.
Creates the path for smart resource scheduling.
Makes it possible to do [resizing, live migration] in a more secure and manageable manner.
- Discussion about how this can be done correctly require a intermediary to orchestrate this ownership transfer.
Makes it possible for nova to have multi-stage booting where an instances and its dependent resources are first reserved, the resources configured, the instance configured, and then finally the instance is powered-on (thus completing the instance provisioning process).

Issues that would likely not have happened with a better state management system

Plan of record

Step 1

Create prototype
1. Create library and use said library in nova for run_instance action in nova.

Step 2

Get feedback on prototype.
1. Get feedback from summit session.
2. Get more feedback from email list + other interested parties.

Step 3

Adjust nova prototype as needed from feedback.
Split nova prototype into small chunks.
Adjust tests for each small chunks (depending on what it changes).
Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).

Step 4

Pick another nova action and refactor it to use design from prototype
Split this other action (refactored) into small chunks.
Adjust tests for each small chunks (depending on what it changes).
Submit chunks into http://review.openstack.org (disabling whole/pieces component until ready to turn on?).
Repeat

Prototype

https://github.com/Yahoo/NovaOrc

Difference between revisions of "StructuredStateManagement"

Revision as of 01:04, 25 April 2013

Contents

Summary

What problems does this solve in general

What problems does this solve in nova (+ the general ones)

Issues that would likely not have happened with a better state management system

Connected blueprints

Connected wikis

Requirements

Discussions

Plan of record

Step 1

Step 2

Step 3

Step 4

Prototype

Design

@@ Line 86: / Line 86: @@
 == Design ==
 [[File:New-arch.png|thumbnail|center]]
-=== Design details ===
-In order to implement of this new orchestration layer the following key concepts must be built into the design from the start.
-# An  '''atomic''' task abstraction.
-# Combining '''atomic'' tasks that can be organized into a unit and ran and on failure reconciled via rollbacks.
-## To start a linear unit is fine.
-# Task resumption.
-# Task rollback.
-# Task tracking.
-# Resource locking.
-# Workflow sharding/ownership.
-# Simplicity (allowing for extension and verifiability).
-# Tolerant to upgrades.
-==== Atomic task abstraction ====
-===== Why it matters =====
-Tasks that are created (either via code or other operation) must be atomic so that the task as a unit can be said to have completed or the task as a unit can be said to have failed. This allows for said task to be rolled back as a unit. It is also useful to be able to be able to accurately track exactly what tasks have been applied to a given workflow, which is inherently useful for correct status tracking (and is directly tied to how resumption is done).
-===== How it will be addressed  =====
-Tasks which previously were very unorganized in the ''run_instance'' path of nova will need to be refactored into clearly defined tasks (likely with an ''apply()'' method and a ''rollback()'' method).  These tasks will be split up so that each task performs a clear single piece of work in an atomic manner (aka not one big task that does many different things) where possible. This will also help make testing of said task easier (since it will have a clear set of inputs and a clear set of expected outputs/side-effects, of which the ''rollback()'' method should undo).
-'''For example''' this could be a task/state baseclass:
- <nowiki>
-class State(base.Base):
-    __metaclass__ = abc.ABCMeta
-    def __init__(self):
-        super(State, self).__init__()
-    def __str__(self):
-        return "State: %s" % (self.__class__.__name__)
-    @abc.abstractmethod
-    def apply(self, context, *args, **kwargs):
-        raise NotImplementedError()
-    def revert(self, context, result, chain, excp, cause):
-        pass</nowiki>
-==== Combining '''atomic'' tasks into a workflow ====
-===== Why it matters =====
-===== How it will be addressed =====
- <nowiki>
-class StateChain(object):
-    def __init__(self, name, tolerant=False, parents=None):
-        self.reversions = []
-        self.name = name
-        self.tolerant = tolerant
-        self.states = OrderedDict()
-        self.results = OrderedDict()
-        self.parents = parents
-        self.result_fetcher = None
-        self.change_tracker = None
-        self.listeners = []
-    def __setitem__(self, name, performer):
-        self.states[name] = performer
-    def __getitem__(self, name):
-        return self.results[name]
-    def run(self, context, *args, **kwargs):
-        for (name, performer) in self.states.items():
-            try:
-                self._on_state_start(context, performer, name)
-                # See if we have already ran this... (resumption!)
-                result = None
-                if self.result_fetcher:
-                    result = self.result_fetcher(context, name, self)
-                if result is None:
-                    result = performer.apply(context, *args, **kwargs)
-                # Keep a pristine copy of the result in the results table
-                # so that if said result is altered by other further states
-                # the one here will not be.
-                #
-                # Note: python is by reference objects, so someone else could screw with this,
-                # which would be bad if we need to rollback and a result we created was modified by someone else...
-                self.results[name] = copy.deepcopy(result)
-                self._on_state_finish(context, performer, name, result)
-            except Exception as ex:
-                with excutils.save_and_reraise_exception():
-                    try:
-                        self._on_state_error(context, name, ex)
-                    except:
-                        pass
-                    cause = (name, performer, (args, kwargs))
-                    self.rollback(context, name, self, ex, cause)
-        return self
-    def _on_state_error(self, context, name, ex):
-        if self.change_tracker:
-            self.change_tracker(context, ERRORED, name, self)
-        for i in self.listeners:
-            i.notify(context, ERRORED, name, self, error=ex)
-    def _on_state_start(self, context, performer, name):
-        if self.change_tracker:
-            self.change_tracker(context, STARTING, name, self)
-        for i in self.listeners:
-            i.notify(context, STARTING, name, self)
-    def _on_state_finish(self, context, performer, name, result):
-        # If a future state fails we need to ensure that we
-        # revert the one we just finished.
-        self.reversions.append((name, performer))
-        if self.change_tracker:
-            self.change_tracker(context, COMPLETED, name, self,
-                                result=result.to_dict())
-        for i in self.listeners:
-            i.notify(context, COMPLETED, name, self, result=result)
-    def rollback(self, context, name, chain=None, ex=None, cause=None):
-        if chain is None:
-            chain = self
-        for (i, (name, performer)) in enumerate(reversed(self.reversions)):
-            try:
-                performer.revert(context, self.results[name], chain, ex, cause)
-            except excp.NovaException:
-                # Ex: WARN: Failed rolling back stage 1 (validate_request) of
-                # chain validation due to nova exception
-                # WARN: Failed rolling back stage 2 (create_db_entry) of
-                # chain init_db_entry due to nova exception
-                msg = _("Failed rolling back stage %s (%s)"
-                        " of chain %s due to nova exception.")
-                LOG.warn(msg, (i + 1), performer.name, self.name)
-                if not self.tolerant:
-                    # This will log a msg AND re-raise the Nova exception if
-                    # the chain does not tolerate exceptions
-                    raise
-            except Exception:
-                # Ex: WARN: Failed rolling back stage 1 (validate_request) of
-                # chain validation due to unknown exception
-                # WARN: Failed rolling back stage 2 (create_db_entry) of
-                # chain init_db_entry due to unknown exception
-                msg = _("Failed rolling back stage %s (%s)"
-                        " of chain %s, due to unknown exception.")
-                LOG.warn(msg, (i + 1), performer.name, self.name)
-                if not self.tolerant:
-                    # Log a msg AND re-raise the generic Exception if the
-                    # Chain does not tolerate exceptions
-                    raise
-        if self.parents:
-            # Rollback any parents chains
-            for p in self.parents:
-                p.rollback(context, name, chain, ex, cause)</nowiki>
-==== Task resumption ====
-===== Why it matters =====
-===== How it will be addressed =====
-==== Task rollback ====
-===== Why it matters =====
-===== How it will be addressed =====
-==== Task tracking ====
-===== Why it matters =====
-===== How it will be addressed =====
-==== Resource locking ====
-===== Why it matters =====
-===== How it will be addressed =====
-==== Workflow sharding/ownership ====
-===== Why it matters =====
-===== How it will be addressed =====
-==== Simplicity ====
-===== Why it matters =====
-===== How it will be addressed =====
-==== Tolerant to upgrades ====
-===== Why it matters =====
-===== How it will be addressed =====