DuplicateWorkCeilometer

aka "Retry Semantics"
Or ... "What happens when Step 4 of a 10 Step Pipeline fails?"

Here's a common use case in Ceilometer (and, most parts of OpenStack for that matter):


 * 1) We have some work to do that will take several steps. We could fail anywhere along the way.
 * 2) If we fail, we may get called again by the service that initiated the action.
 * 3) We don't want to redo the steps we've already done. Instead, we'd like to retry just the part that failed. Redo'ing previous work could have disastrous ramifications:
 * writing duplicate records to a database,
 * sending a notice to a downstream user/service thousands of times,
 * performing an expensive calculation over and over DDOS'ing ourselves, etc.

Currently in CM, we have the the following major components:
 * The collector
 * The dispatcher
 * The pipeline
 * Alarm states
 * and, coming soon, the trigger pipeline

The problem with the above use case manifests itself in the dispatcher, the pipeline and will in the trigger pipeline. The collector is pretty dumb now thanks to the dispatcher.

The way I was hoping to prevent this problem is:
 * 1) the collector would re-publish to a new CM exchange for each downstream component that needs to do some unit of work. This could be a pipeline plugin, a dispatcher plugin or a trigger pipeline plugin.
 * dhellmann - Does ceilometer need to republish, or should we set up a different queue for each pipeline? That way the message broker will handle the messages and redelivery if the pipeline raises an exception, but the collector won't be a bottleneck to creating those new messages.
 * swalsh: depends. If it's the same datatype, there would be 1 queue per notification. If the pipeline component produced a new datatype (like an Event producing 3 new Samples), there would be a publication required for each Sample.


 * 1) Each plugin would do its work and use the normal queue ack, reject, requeue semantics of the queuing engine to deal with failures.
 * 2) This means that each plugin is responsible for detecting if it has already run or not. For some entities, such as events, this is easy since there is a unique message_id per event. But let's say that event generates 10 samples, we would need to repeatably produce a unique id for each sample based on that event id. For things like publishing, that gets very difficult since there is no related database record.

The downside with always going back to the queuing system is the large amount of chatter it will produce. Consider, 1 event that produces 10 meters, each with their own 3-step pipeline doing complex calculations and outbound publishing. That gets expensive quickly.

What we need is some way to say "I've done this step already".

Some possibilities:
 * 1) Each pipeline would manage the retry semantics for the pipeline. If any step fails, the pipeline manager would retry starting at the failed step. This is tricky since we would need to have a context object to persist and pass back to the failed plugin during the retry.
 * 2) Ditch the pipeline model. The collector would call a dispatcher when a particular event comes in. The dispatcher would do some work and either:
 * 3) ack and republish a new event to a new exchange, or
 * 4) requeue the event, or
 * 5) reject the event
 * 6) For non-pipeline stuff (the alarm state), we sort of need a persistent state-machine anyway. Rather than reinvent this for every component a reusable state machine library would be preferred (perhaps like the one the orchestration/state-management team is working on). Something that has schema-defined transitions with proper timeouts & persistence handled along each transition. A collection of enums and a snarl of if-statements is not a good state machine. I think this state machine library could also be used for the pipeline management as well.


 * dhellmann - The steps of the pipeline are meant to be relatively inexpensive transformations, not full processing. So each pipeline is massaging the data before taking the final action, which is the thing that might be reasonably expected to fail (publishing, writing to a db, whatever). Do we have (or envision) any transformations that might "fail" if they see the same data twice? Is it good enough to document that they are expected to be implemented in a way so they can avoid such issues?
 * sandywalsh - in our experience the intermediate steps are generally the expensive ones, the final steps are relatively lightweight. We've seen that there are some pretty complex queries involved when a significant event occurs (a .end or the end of a long sequence, for example). These can fail in a number of ways, but db timeouts are common, likewise with uncaught divide-by-zeros, etc. The final step (writing to the db, emailing a summary, etc) are pretty light relatively. That said, I think we have to assume things could fail anywhere for any reason. Currently, I'm happy with it failing and the queue growing until we can figure it out, but as we get more distributed (more collectors, more pipelines to carry the load) things are going to have to self-heal.
 * "dhellmann" - I think I need a more concrete example of what you're doing in the middle of the pipeline, then. I don't think we anticipated any transformations that query a database, for example.
 * mdragon - Having some kind of persistent state machine will definitely be needed for the triggers work, so that will help there. I see the pipelines (both the current sample pipelines, and the forthcoming event-pipelines) as fairly idempotent up to the point of the final persist/publish step. Basically, you do the various transformations, altering/adding items, and until you persist those changes, they can be redone safely.  Part of the idea of the triggers is to encourage this. Instead of a transformer loading a bunch of related events from a persistant store directly, the trigger waits til you have all of the needed events, loads them from the datastore, and sends them into a processing pipeline as a batch, so you have (hopefully) all of the data there that you need. That said, some of the transformations may be involved. My worry about persisting each step along the way is: What happens if step 5 fails because step 2 screwed up?
 * sandywalsh - So, what happens if the trigger pipeline service is restarted between, let's say, a .start and a .end? How can we ever get the downstream items created? I agree about idempotence of most steps, but it's the publishing steps that have me most concerned. You bring up a good use case of step 5 failing due to step 2. That may fall into the "how do we re-run on restart?" problem space? Again, perhaps that's the job of the TaskFlow?

A lightweight state machine engine that supports atomic updates is what we need. This could be backed by a database with good transaction support or something like zookeeper. Something like memcache wouldn't have locking we need. Either way the backend for this piece should be pluggable. We should chat with Josh Harlow to find out where they are with their effort and help out there if possible.

Additional problems:
 * the problem of a single plugin generating N items before failing on the N+1 th item. My gut says we simply need to persist for each step of the generator. :/