Ceilometer-batch-consumption

Launchpad Entry: bulk-message-handling
Created: 09 Jan 2014
Contributors: John Herndon <john.herndon@hp.com>

Summary

Currently, ceilometer handles notifications one at a time. Individual events are inserted into the storage back-end, individual notifications are handled as meters, etc. In many cases, it is more efficient to handle notifications many at a time, in a batch. This blueprint proposes a way to do that.

Release Note

Rationale

User stories

Assumptions

This feature will be configurable. If an operator does not wish to use batching, it will remain off.
Dependent on https://review.openstack.org/#/c/57457/

Design

The intention is to create the ability for all message collection pipelines in ceilometer to use batching if so desired. The initial implementation will focus on notification collection.

This (rough) drawing illustrates how messages will flow through ceilometer in a batch configuration.

Proposed Data Flow

oslo.messaging will call process_notificaions, handing in a single notification
if configured with batching, the notification will be passed into the notification batch. The Message Batch class will manage notifications, and determine when the batch is full enough to pass on. "full enough" will be based on the number of batched messages, or a provided timeout. The timeout will be provided so that messages cannot sit in the batch indefinitely.
When a batch is created, a "handler" will be assigned to it. The handler will control what to do with the batch. In the case of notifications, the batch must be handed off to both the event storage driver, and the meter pipeline. I am proposing to change the order in which notifications are handled (current implementation passes the notification to the meter pipeline before handling as an event. This is done to prevent duplicates from happening in the meter pipeline. Duplicates are a concern if there is an error in the ceilometer collector that causes a batch of messages to be re-queued. When messages are re-queued, the collector will process them again on the next run (or on failover, the new collector will process the notifications). Duplicates are not acceptable for the metering pipeline, as they will skew the numbers, and are basically undetectable. In the case of event collection, duplicates are detectable, since the message_id of the notification will be stored. If the event parsers handles the notifications after the meters, and the events encounter a problem that cause the notifications to be requeued, the metering pipeline will process the notification a second time. However, if event processing happens first, the meter pipeline will not be affected as it will not have processed the notifications.
The batch notification handler will pass the batch into the event processing code, which parses each notification into an event. If a notification cannot be parsed (ie, invalid json payload), it will be handled by a new piece of code (called message_recovery, to be defined in a separate bp, possibly) that will store the raw notification somewhere until a human can look at it and possibly correct whatever generated the invalid notification. Note, this message will not be re-queued, since doing so will cause it to be re-processed infinitely. This notification will not be passed on to the meter pipeline, since it's invalud.
The batch of events is sent to the storage driver. At this point, if the storage driver is not accessible, the batch will be aborted, the events re-queued, and the collector will stop polling events from the message queue until connection to the backend can be restored. Once the connection is restored, collection can resume.
The failed messages are returned. For individual rejected events, (ie, due to an invalid data type), the message_recovery interface will be used again, as describe above. This notification will, however, be passed to the meter pipeline (maybe the meters can deal with it?).
The batch of notifications (without errors from step 4) are handled by the meter pipeline.

Implementation

UI Changes

None

Code Changes

The main code changes will take place in the notification.py class. The Batch class will be added, and a BatchHandler interface will be defined. Disabling the batching will be handled as a direct call to the batch_handler with a single notification.

Migration

None

Test/Demo Plan

This need not be added or completed until the specification is nearing beta.

Unresolved issues

This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.