Jump to: navigation, search

NovaOrchestration

Revision as of 18:12, 23 September 2011 by SandyWalsh (talk)

Long Running Transactions in Nova

What is a long running transaction? A long running transaction is a business activity that may have to wait for a long period of time between steps. For example, if you are a new employee with a company the HR department might have to perform the following operations before your first day of work:

  1. allocate you a parking place
  2. get you an access card
  3. add you to the HR system
  4. get you set up with health care and benefits
  5. find you a desk, chair, etc
  6. determine whom you will report to
  7. order you a PC
  8. etc, etc.

There could be a long period of time from the start to the end of each task. Each task in itself might be composed of many other long-running transactions. Also, many of these tasks might be able to be performed in parallel, which means we have to be able to fork many sub-transactions and join them together at some point before continuing on.

During the time between the process starting and ending a multitude of IT events may have occurred. Servers may have died, new servers added, databases changed, power failures, etc. We can't assume that a conventional "process" is robust enough to handle these long running operations.

Nova has a number of long running transactions that it needs to manage. Most importantly the provisioning of instances. Consider a request to provision 1000 instances. We have to do the following steps:

  1. Talk to all of the zones and come up with a build plan for the request.
  2. Delegate the "provision" operation to each host in each zone to create the instance.
  3. Wait for the provisioning to occur on all of the hosts.
  4. If a request fails, retry the request on another host (from the build plan)
  5. Periodically create a new build plan with fresher data.
  6. If all of this takes too long we may need to cancel the transaction, notifying the requester.

From the time this operation starts to the time it fails (or completes), the scheduler that started the request could have died and restarted a dozen times. We need to be able to watch this transaction as an outside observer and "orchestrate" the transaction over time.