RaxCeilometerRequirements
We plan to bring the following functionality to Ceilometer:
- Double-entry accounting verification of OpenStack usage before handoff to billing (finance is our primary customer).
- 90+ days storage of high-volume raw notifications (planning for at least 2 billion rows).
- Secondary aggregation/rollups of the raw data with support for third-party hooks into the notification pipeline (tricky with schema changes).
- Support for downstream consumers via PubSubHubBub/Atom mechanisms (such as AtomHopper)
- Monitoring of instance state for detailed debugging and SLA tracking.
The implications of these requirements will require changes to Ceilometer, specifically:
- It appears that once the data is collected, it goes back into another rabbit queue for their Collector to process. It stores the raw data potentially many times as different metrics. For example there are objects for Cpu, IP, Disk, Bandwidth, etc. I don't know really what that buys us, vs raw + rolled up data. Much of the raw data is then discarded (we need to keep it). To get a single picture of an instance may require many queries (we need Request ID, Instance ID, Host, Tenant ID and time-range)
- Requires new blueprint (started doodling here: http://wiki.openstack.org/RichMeters)
- May affect: https://blueprints.launchpad.net/ceilometer/+spec/synaps-dimensional-decomposition
- The individual metrics are stored separately to provide a common format for all of the rest of the code using the data later. - dhellmann
- There is no double entry accounting. It's the raw event, but not consolidated. The question of polling the HV directly as a second source is still there.
- Requires new blueprint
- The Ceilometer Compute Agent is not hypervisor independent. We need support for XenServer. Additionally we feel this data can all be collected via the existing notifications (and if not, Nova should be fixed to provide the required data). This questions the need for the Compute Agent in the first place.
- The nova auditing events were coming less frequently than the resolution of data we wanted. For a long lived instance, we would only get the create events and then an "exists" an hour later. - dhellmann
- We need to do post-processing on the raw data beyond the initial collection. We need the queue after the initial collection to allow for the "settling time" to allow for multiple workers (otherwise you have a bottleneck or staggered data)
- I'm not sure what "settling time" means here. - dhellmann
- Need support for error notifications and capture full state from all .start/.end messages. As it is today, there could be significant miscounts.
- Requires new blueprint
- I think this came up once before, and I thought we were discarding events for instances with errors. Do you mean that is causing an undercount, or do you mean we aren't actually discarding those events properly? - dhellmann
- Millisecond timing resolution regardless of database.
- Requires new blueprint
- Stop using the nova rpc mechanism which does an ACK immediately regardless of if event was properly handled or not.
- The kombu driver seems to ack after the callback is invoked and returns without an error (https://github.com/openstack/ceilometer/blob/master/ceilometer/openstack/common/rpc/impl_kombu.py#L166) -- dhellmann
- Extend the API to include StackTach-like operations for KPI's, etc.
Other Minor Nits:
- Ceilometer extensively uses the openstack.common library ... I'm not sure what this really buys us. It seems like there is a lot of boiler plate just to work with this. Could be a lot easier.
- This was discussed on the mailing list http://lists.openstack.org/pipermail/openstack-dev/2013-January/004846.html - dhellmann