Unified Instrumentation and Metering

Overview

With Ceilometer, Tach and StackTach, there are some very cool initiatives going on around instrumentation and metering/monitoring within OpenStack today.

However, due to incredible speed of development within OpenStack many of these efforts have been performed in isolation from each other. Now, we are reaching a level of maturity which demands we stop reinventing the wheel and agree upon some shared infrastructure. This need is necessary for a variety of reasons:

We want to make it easier for new projects to build on the existing OpenStack notification message bus.
Less code is good. We shouldn't need three different workers for extracting notifications from OpenStack.
Notifications are large and there's a lot of them. We only want to process and store that data once.
Archiving of data is a common problem, we shouldn't need several different ways of doing it.

Instrumentation vs Metering/Monitoring

At the very base of this discussion is having a clear understanding of the difference between Instrumentation and Metering/Monitoring.

Instrumentation

Think of instrumentation as the way they test electronics in-circuit. While the device is running, probes are attached to the circuit board and measurements are taken.

File:UnifiedInstrumentationMetering$instrumentation.png

There are some key things to consider in this analogy:

Every technician may want to place their probes in different locations.
Probes might be placed for long term measuring or transient ("I wonder if ... " scenarios)
The circuit does not have to change for the testing probes to be placed. No other groups or departments had to be involved for this instrumentation to occur.
The same probe technology can be used on other circuit boards. Likewise, our instrumentation probe-placement technology should not just be geared towards Nova. It should also work with all other parts of OpenStack.
When the circuit changes our probe placement may have to change. We have to be aware of that.
The probes aren't perfect. They might slip off or have a spotty connection. We're looking for trends here, identifying when things are slow or flaky.
With respect to Python, we may be interested in stack traces as well. Not just single function timings/counts.

Metering / Monitoring

Metering is watching usage of the system, usually for the purposes of Billing. Monitoring is watching the system for critical system changes, performance and accuracy, usually for things like SLA's.

Think of your power meter. You can go outside and watch the dial spin and confirm your monthly bill jives with what the meter is reporting.

File:UnifiedInstrumentationMetering$metering.png

The important aspects of metering and monitoring:

These events/measurements are critical. We cannot risk dropping an event.
We need to ensure these events are consistent between releases. Their consistency should be considered of equal importance as OpenStack API consistency.
We have no idea how people are going to want to use these events, but we can safely assume there will be a lots of other groups interested in them. We don't want these groups talking to the production OpenStack deployments directly.
These events may not be nearly as frequent as the instrumentation messages, but they will be a lot larger since the entire context of the message needs to be included (which instance, which image, which user, etc)

A Proposal for a Common Infrastructure