RaxCeilometerRequirements

We plan to bring the following functionality to Ceilometer:
 * 1) Double-entry accounting verification of OpenStack usage before handoff to billing (finance is our primary customer).
 * 2) 90+ days storage of high-volume raw notifications (planning for at least 2 billion rows).
 * 3) Secondary aggregation/rollups of the raw data with support for third-party hooks into the notification pipeline (tricky with schema changes).
 * 4) Support for downstream consumers via PubSubHubBub/Atom mechanisms (such as AtomHopper)
 * 5) Monitoring of instance state for detailed debugging and SLA tracking.

The implications of these requirements will require changes to Ceilometer, specifically:


 * It appears that once the data is collected, it goes back into another rabbit queue for their Collector to process. It stores the raw data potentially many times as different metrics. For example there are objects for Cpu, IP, Disk, Bandwidth, etc. I don't know really what that buys us, vs raw + rolled up data. Much of the raw data is then discarded (we need to keep it). To get a single picture of an instance may require many queries (we need Request ID, Instance ID, Host, Tenant ID and time-range)
 * Requires new blueprint (started doodling here: http://wiki.openstack.org/RichMeters)
 * May affect: https://blueprints.launchpad.net/ceilometer/+spec/synaps-dimensional-decomposition
 * The individual metrics are stored separately to provide a common format for all of the rest of the code using the data later. - dhellmann
 * After digging in a little more we need to nail down how we're going to apply indices to keys in the metadata ... since this is ultimately the intent here. I'll get the video of the StackTach data model together shortly so we can show the sort of aggreations we're doing. There is still the problem of how to store the full raw event (the json payload) ... would this be a Meter of type String? -S
 * There is no double entry accounting. It's the raw event, but not consolidated. The question of polling the HV directly as a second source is still there.
 * Requires new blueprint
 * The Ceilometer Compute Agent is not hypervisor independent. We need support for XenServer. Additionally we feel this data can all be collected via the existing notifications (and if not, Nova should be fixed to provide the required data). This questions the need for the Compute Agent in the first place.
 * Affected blueprints:
 * https://blueprints.launchpad.net/ceilometer/+spec/remove-nova-imports
 * https://blueprints.launchpad.net/ceilometer/+spec/xenapi-support
 * The nova auditing events were coming less frequently than the resolution of data we wanted. For a long lived instance, we would only get the create events and then an "exists" an hour later. - dhellmann
 * Why would you need more? You'd just be repeating the same data. For something like bandwidth (or image usage) CM should adopt the "usage" framework already in Nova or use a UDP broadcast scheme. -sandy
 * We need to do post-processing on the raw data beyond the initial collection. We need the queue after the initial collection to allow for the "settling time" to allow for multiple workers (otherwise you have a bottleneck or staggered data)
 * Affected blueprints:
 * https://blueprints.launchpad.net/ceilometer/+spec/multi-publisher
 * https://blueprints.launchpad.net/ceilometer/+spec/cw-publish
 * https://blueprints.launchpad.net/ceilometer/+spec/synaps-alarm-evaluation
 * I'm not sure what "settling time" means here. - dhellmann
 * When you have more than one worker, there is no guarentee of order of events. Which means you could get a .end before a .start (or two .start's in a row). Like a TCP jitter-buffer, you need to give the collector time for the queue to stablilize before doing work on it. Time to let the events get in line. Settle time. -Sandy
 * Need support for error notifications and capture full state from all .start/.end messages. As it is today, there could be significant miscounts.
 * Requires new blueprint
 * I think this came up once before, and I thought we were discarding events for instances with errors. Do you mean that is causing an undercount, or do you mean we aren't actually discarding those events properly? - dhellmann
 * I didn't see any use of the .error queue in the existing CM agent. You could get a .start and then an event on the .error queue, but only the .start would be seen. -Sandy
 * Millisecond timing resolution regardless of database.
 * Requires new blueprint
 * Stop using the nova rpc mechanism which does an ACK immediately regardless of if event was properly handled or not.
 * Affected blueprints:
 * https://blueprints.launchpad.net/ceilometer/+spec/remove-nova-imports
 * https://blueprints.launchpad.net/ceilometer/+spec/move-listener-framework-oslo
 * The kombu driver seems to ack after the callback is invoked and returns without an error (https://github.com/openstack/ceilometer/blob/master/ceilometer/openstack/common/rpc/impl_kombu.py#L166) -- dhellmann
 * correct, which is bad. -Sandy
 * Extend the API to include StackTach-like operations for KPI's, etc.

Other Minor Nits:


 * Ceilometer extensively uses the openstack.common library ... I'm not sure what this really buys us. It seems like there is a lot of boiler plate just to work with this. Could be a lot easier.
 * This was discussed on the mailing list http://lists.openstack.org/pipermail/openstack-dev/2013-January/004846.html - dhellmann
 * Yep, when Oslo matures it should be less heavy weight. -Sandy