Long Running Transactions in Nova
This is a Work In Progress
What is a long running transaction? A long running transaction is a business activity that may have to wait for a long period of time between steps. For example, if you are a new employee with a company the HR department might have to perform the following operations before your first day of work:
- allocate you a parking place
- get you an access card
- add you to the HR system
- get you set up with health care and benefits
- find you a desk, chair, etc
- determine whom you will report to
- order you a PC
- etc, etc.
There could be a long period of time from the start to the end of each task. Each task in itself might be composed of many other long-running transactions. Also, many of these tasks might be able to be performed in parallel, which means we have to be able to fork many sub-transactions and join them together at some point before continuing on.
During the time between the process starting and ending a multitude of IT events may have occurred. Servers may have died, new servers added, databases changed, power failures, etc. We can't assume that a conventional "process" is robust enough to handle these long running operations.
Nova has a number of long running transactions that it needs to manage. Most importantly the provisioning of instances. Consider a request to provision 1000 instances. We have to do the following steps:
- Talk to all of the zones and come up with a build plan for the request.
- Delegate the "provision" operation to each host in each zone to create the instance.
- Wait for the provisioning to occur on all of the hosts.
- If a request fails, retry the request on another host (from the build plan)
- Periodically create a new build plan with fresher data.
- If all of this takes too long we may need to cancel the transaction, notifying the requester.
From the time this operation starts to the time it fails (or completes), the scheduler that started the request could have died and restarted a dozen times. We need to be able to watch this transaction as an outside observer and "orchestrate" the transaction over time.
The main problem is determining the success/failure of a transaction step (aka a "work item"). Fortunately Nova has a Notification system integrated. Work item success and failure events are sent on Rabbit queues on success and failure. For auditing/billing purposes, when an instance is started or stopped notifications are sent. Likewise, when an error occurs (in Compute nodes only currently), notifications are sent on the "error" queue.
Frameworks like Yagi can be used to collect these events and relay them to other systems for processing in a reliable fashion. For example, these events may be sent to a PubSubHubBub server to be relayed to interested consumers.
Additionally, for anything that is going to be resilient against server failures the solution will need to be based on some sort of state machine where the state is persisted in a database. Simply having a "monitoring thread" running in a service isn't sufficient. Simple state machines are fine when only a single state is being managed. A traffic light is a single-state machine. It's either Red, Yellow or Green. For more complex systems, particularly where concurrency is involved, many state machines may have to interact.
Consider, for example, the case of initially provisioning 100 servers. This is something that can be done in parallel. We can fire off 100 requests and monitor each of them to see that the Instance, Disk and Network were all set up correctly or not. Essentially we are spinning up 100 little state machines and then we have a master state machine overseeing each of the sub-tasks. Now, we could do this with some concept of nested single-state machines, but there are other data structures better suited to this problem.
The key to a successful solution is guaranteeing that transitions in the state machine are idempotent while allowing for a horizontally scalable solution. In other words, we need to be able to stand up more than one Orchestration server, yet any one server must ensure that the actions it performs are not duplicated by other Orchestration servers. Or, if they are, the net effect is the same. Because of this we will explain why a Yagi-based solution is not the best approach and an alternative.
The proposal, based on many different discussions that have occurred informally or on the ML, is to have a new "Orchestration" Service that manages these long running transactions. When a new activity is started in Nova, that may take a while to run (like starting/stopping a bunch of servers), a new workflow process is kicked off in the Orchestration service.
The service walks through the process, talking to each of the nova services via their related API's to get the work started. It then listens for success/failure from these actions directly from the success/failure notification queues.
As the process continues, each step may be handled by other Orchestration servers ... not necessarily the one that started the process.
Why not use Yagi? Because when we talk to yagi-feed we give it an etag of the last time we spoke. This essentially says "Give me all the events that have occurred since X". We will get a blast of events and the orchestration server needs to work through them one-by-one. If the orchestration server dies, it will reprocess the same events on the next poll (there is no etag per message, just etag per request ... arguably we could only ask for one event, but that seems inefficient). Also, if we horizontally scale the orchestration server, there is no guarantee that two servers might pick up the same feed at once.
Instead, let's let rabbit do what it does best ... assure atomic processing of events in the queue.
There is something we would need to change however. Currently our rabbit processing immediately ACK's the message when it is pulled from the queue. In the case where the service dies while processing the message, the message will be lost. Instead, for the orchestration service we want to ACK the event after it has been processed. (Although, equally arguably, there is still a window where we could process the same event more than once if the service died after processing the event but before ACK'ing the message). Somehow we still need to prevent duplicate processing in the state machine.
Note: I don't know if orchestration needs to be a standalone service or can belong to the existing scheduler (already responsible for routing) ... seems quite likely.
The first step in this implementation would be passing the Build Plan from the scheduler and forwarding it to the Orchestration service for execution. This stub can be expanded out for customers that wish to utilize their own workflow solutions.
Tie in a lean and mean workflow solution to handle the state machine management. This should be a petri-net based state-machine so that it properly supports concurrent workflows. Perhaps something like Spiff Workflow which is a pure-python based solution with no DB or UI layers or other cruft.
Or, perhaps a simple single-state machine could be used to get things moving?
Model some basic workflows to see how they work under load. Then iterate, iterate, iterate. Add more workflows. Remove more and more hard-coded business logic from Nova core and into whatever grammar is deemed best. Add more traps for failing/success conditions. Focus on generating events, not instance-specific retry/error handling code.
Fail Fast vs. Automation
One concern I've heard to this proposal is that retry is hard. How do we know that we won't do more harm to a failing host by retrying? This may be possible, but retrying the same server (or retrying at all) isn't something that's mandatory.
The rational with moving this hard coded retry logic (for example) into orchestration is that other groups can manage the workflows that occur on success/failure conditions.
Automating these processes should be a primary concern of anyone running Openstack. The more manual inspection, correction and adjustments the operators have to perform, the less efficient they will be. We want to externalize all of this business logic into a place that's:
- Easily changed
- Easily understood
Having complicated retry/notification/error handling business logic nestled deep in Python code doesn't seem like a very efficient way to run an operations center. Currently nova is fail-fast. If something goes wrong the operation ends and we're left sorting through logs to find out what happened. The level of abstraction needs to be raised ever-so-slightly such that process automation can occur. Note, don't confuse this with sysop monitoring, such as load-monitoring, disk-monitoring, network-monitoring, etc which are all controlled by other systems.
Scheduler improvemnt vs Separate Service
1) Scheduler can fail-fast, orchestrator can retry
- Add why failed - for instance, compute node cannot allocate resources vs catastrophic failures where reason may or may not be available.
- Retry logic configurable
2) Example Orchestrator Retry logic
- Define MAX_RETRY_COUNT, MAX_SAME_HOST_RETRY_COUNT (< MAX_RETRY_COUNT)
- Retry with another host if resource allocation failed (until MAX_RETRY_COUNT) - In case of catastrohic failure, retry with the same host until MAX_SAME_HOST_RETRY_COUNT - else try anothre host (until MAX_RETRY_COUNT)
3) Serving allowable Intra-zone requests easier
- By moving the orchestration logic away from scheduler, intra-zone requests can be handled potentially by more than one orchestrator. With a scheduler+ implementation, it is going to be complex.
Rabbit-MQ based implementation
- Losing ACKs if the process dies while processing a request or a request being processed more than once.
- For every request, can we push in processing state back in to the queue, and ACK after it has been processed?
- If a request is in processing, but no ACK, Orchestrator would know that this is being handled by someone. This is to prevent a request being processed twice.
- Can we also add a retry-after and time-out (time-out > retry-after) for each request. Retry-after would be the time after orchestrator tries to reprocess the request EVEN if it is in the processing state or no ACK is received yet. This is to make sure a request is not left un-handled.
- Time-out would be when Orchestrator gives up. This is to prevent thrashing
- Add more details to the proposed 3 steps
- Add more details to the state machine