Unified Instrumentation and Metering

/!\ WORK IN PROGRESS - DO NOT READ (YET)

Overview

With Ceilometer, Tach and StackTach, there are some very cool initiatives going on around instrumentation and metering/monitoring within OpenStack today.

However, due to incredible speed of development within OpenStack many of these efforts have been performed in isolation from each other. Now, we are reaching a level of maturity which demands we stop reinventing the wheel and agree upon some shared infrastructure. This need is necessary for a variety of reasons:

We want to make it easier for new projects to build on the existing OpenStack notification message bus.
Less code is good. We shouldn't need three different workers for extracting notifications from OpenStack.
Notifications are large and there's a lot of them. We only want to process and store that data once.
Archiving of data is a common problem, we shouldn't need several different ways of doing it.

Instrumentation vs Metering/Monitoring

At the very base of this discussion is having a clear understanding of the difference between Instrumentation and Metering/Monitoring.

Instrumentation

Think of instrumentation as the way they test electronics in-circuit. While the device is running, probes are attached to the circuit board and measurements are taken.

File:UnifiedInstrumentationMetering$instrumentation.png

There are some key things to consider in this analogy:

Every technician may want to place their probes in different locations.
Probes might be placed for long term measuring or transient ("I wonder if ... " scenarios)
The circuit does not have to change for the testing probes to be placed. No other groups or departments had to be involved for this instrumentation to occur.
The same probe technology can be used on other circuit boards. Likewise, our instrumentation probe-placement technology should not just be geared towards Nova. It should also work with all other parts of OpenStack.
When the circuit changes our probe placement may have to change. We have to be aware of that.
The probes aren't perfect. They might slip off or have a spotty connection. We're looking for trends here, identifying when things are slow or flaky.
With respect to Python, we may be interested in stack traces as well. Not just single function timings/counts.

Metering / Monitoring

Metering is watching usage of the system, usually for the purposes of Billing. Monitoring is watching the system for critical system changes, performance and accuracy, usually for things like SLA's.

Think of your power meter. You can go outside and watch the dial spin and confirm your monthly bill jives with what the meter is reporting.

File:UnifiedInstrumentationMetering$metering.png

The important aspects of metering and monitoring:

These events/measurements are critical. We cannot risk dropping an event.
We need to ensure these events are consistent between releases. Their consistency should be considered of equal importance as OpenStack API consistency.
We have no idea how people are going to want to use these events, but we can safely assume there will be a lots of other groups interested in them. We don't want these groups talking to the production OpenStack deployments directly.
These events may not be nearly as frequent as the instrumentation messages, but they will be a lot larger since the entire context of the message needs to be included (which instance, which image, which user, etc)

Instrumentation Today

Rackspace has being using a cocktail of Tach, Statsd, Graphite and Nagios to great success for close to the last year.

File:UnifiedInstrumentationMetering$instrumentation arch.gif

Tach is a library that collects timing/count data from anywhere in a Python program. It's not specific to OpenStack, but it has pre-canned config files for the main OpenStack services. Tach hooks into a programming using monkey-patching and has a concept of Metrics and Notifiers. The Metrics are user-extensible hooks that pull data from the code. The Notifiers take the collected data and send it somewhere. Currently there are Metrics drivers for execution time and counts as well as Notifiers for Statsd, Graphite directly, print and log files. (SandyWalsh has been working on a replacement for Tach, called Scrutinize which adds cProfile support and easier configuration. It's almost ready for prime-time.)

Tach is launched as: tach tach.conf nova-compute nova.conf ... so it easily integrates with existing deployments.

The powerful features of statsd are:

UDP based messaging, so production is not at risk if the collectors die.
In-memory rollup/aggregate of measurements that get relayed to Graphite. This greatly enhances scalability.
Written in node.js = fast, fast, fast.

Monitoring/Metering Today

Certainly the face of Monitoring and Metering within OpenStack today is Ceilometer ... I'll just refer you to that project for more details.

But there are others. Within Rackspace we use YAGI to consume the notifications and send them to our internal billing system. Specifically, this data is sent to AtomHopper where it is turned into an Atom feed for other consumers (one of which is billing). YAGI used to have PubSubHubBub support, but that's gone dormant due to other motivators. Now, AtomHopper is the redistribution system of choice. Sadly, AtomHopper is Java-based, so it may not work well within the OpenStack ecosystem, per se. The YAGI Worker uses [1]] and has been highly reliable in all of our environments, but there has been discussion of moving to kombu.

StackTach is a debugging/monitoring tool based on OpenStack notificaitons and it too has it's own Worker. It is kombu-based and is currently used in production. We've had lots of problems making the stacktach worker reliable but we think the problem has been with combining a threading model with eventlet. Our new scheme uses the multiprocessing library with per-rabbit workers. This is currently being stress tested, stay tuned. We've tried a variety of other schemes and library combinations with little success (more detail if needed).

StackTach is quickly moving into Metrics, SLA and Monitoring territory with version 2 and the inclusion of Stacky (the command line interface to StackTach)

File:UnifiedInstrumentationMetering$notifications.gif

Attempting to Merge Instrumentation with Metering/Monitoring

Shouldn't be done at the low level. There may be commonality at the high-level (traffic-lighting, reporting, graphing, alerts, etc.), but at the low-level these are very different animals.

A Proposal for a Common Monitoring/Metering Infrastructure

There is a lot of common infrastructure between each of these efforts that we should unify. This includes:

The Low Level

The worker used to pull the data from OpenStack. Using a list of queues in the rabbit notifier is not a solution.
The database used to collect the data.
A redistribution system for feeding other consumers (and getting the data away from production as quickly as possible)

A Common Notification Worker

This is the low-hanging fruit. Creating a scalable worker that can work in a multi-rabbit (multi-cell, mult-region) deployment. It should support a pluggable scheme for handling the collected data and support failover/redundancy. The worker has to be fast/reliable enough to keep the notification queue empty since this is easily the fastest growing queue (neck and neck with the cells capacity update queue :)

A Common Database for Collected Data

A Common Event Redistribution System

Once

The High Level

The presentation/business-specific processing of the collected data

Other Improvements

Remove the Compute service that Ceilometer uses and integrate the existing fanout compute notifications into the data collected by the workers. There's no need for yet-another-worker.