Ceilometer/AlarmImprovements

Terminology

 * Meter – a measurement taken from some data source. CPU utilization, disk space, new-instances-per-hour would all be meters.
 * Sample – a single measurement of a meter. A meter could have metadata associated with it, but this doesn't need to be copied to the sample. The sample is just the value difference and the time of the measurement.
 * Notification – a JSON message sent from an OpenStack service via the notification bus. This is just a nested dict of key-value pairs. Notifications are atomic units that indicate something important happened. They are not transactional.
 * Event – A stripped down notification. Notifications have a lot of extra context data associated with it. Events are optimized for quick access and only contain the most important key-value data (Traits).

Part 1 - The Sample Pipeline
Let's do a quick review of the CM architecture:




 * (1) Notifications come in via the queuing layer from the various OpenStack services.
 * (2) (3) The Notification Managers look for particular notifications by name and turn these into Samples. One notification may result in several samples.
 * (4) These samples are passed into one of many Sample Pipelines depending on the type of sample (wildcard matching). The pipeline is a series of transformers followed by a series of publishers. A transformer may alter the sample before passing it along to the next transformer.
 * The number of active pipelines fired from a single notification should be small, likely just one but usually less than five.
 * (5) The publishers take the final sample and send it somewhere. The most common publisher would be the RPC publisher, which does an rpc cast back to the collector services (via collector.service.record_metering_data)
 * The Dispatchers (6) take samples and events and do stuff with them in a user defined fashion. By default they are stored to the db, but they could optionally get sent to the log file or any other destination. This is all done in the collector.

Areas for Improvement

 * 1. Any time the RPC methods are used, especially cast, we lose all control over critical events. We have no idea if a call succeeded or failed and if we want to deal with retries we have to code it ourselves. We'd be better off using the notify methods with ack/requeue semantics. Let the queuing system deal with the hard stuff.
 * 2. Notification->Sample conversion requires code & changes to setup.cfg. This should be data-driven like the way the notification->event mapping is done (with plugins for complex cases). These samples should be generated from events rather than raw notifications. The Samples generated would be passed on a side-channel down the pipeline, and forwarded to the regular sample pipeline machinery when the pipeline is done. For more, see (https://blueprints.launchpad.net/ceilometer/+spec/event-sample-plugins)
 * 3. There's no clear reason for the distinction between transformers and publishers. Why can't they just do some unit of work?
 * 4. Theoretically the separation of Meter and Sample makes sense. The code doesn't enforce this abstraction and is the source of a lot of confusion.
 * 5. If a processing failure should occur anywhere in the sample pipeline, the notification is lost. There are no retries. http://lists.openstack.org/pipermail/openstack-dev/2013-August/013710.html
 * 6. If there are multiple collectors, there are no provisions for temporal ordering. No one collector will see all notifications from a given source. Accumulating transformers will give inaccurate results across multiple collectors.
 * 7. Idempotence must be socially enforced, it is not guaranteed by the framework. Our code reviews should consider this and existing transformers and publishers re-evaluated.
 * 8. CM only listens on the .info queues right now. It needs to listen to .error as well.

Part 2 – The Trigger Pipeline
The coming trigger pipeline is a system for dealing with notifications as events (vs samples) and utilizing the relationships between events.



At first glance, the flow would appear very similar to the sample pipeline. It's the same in the sense that it's a pipeline, but the semantics of how it works is very different.


 * (1) as with the sample pipeline, a notification comes in from an OpenStack service
 * The Event Converter (2) is a data-driven process for extracting fields from a notification and turning them into Event (3) + Trait objects (a Trait is a typed key-value pair).
 * The event is handed to the Dispatchers for processing (4).
 * The trigger pipeline (5) is given the event for processing. This will either be done through the DB Dispatcher or the Collector Service itself (tbd).
 * The trigger pipeline are somewhat long-lived streams that are created once a triggering event is observed. A triggering event could be an event originating from an API service, or an event that ends in .start or .end, or events that share a common Request ID trait. The trigger criteria is determined by a new configuration file designed for this purpose.
 * Once created, the stream will persist and share state across multiple collectors.
 * As new notifications/events arrive that match the criteria of any active pipelines, the events will be handed to them for inclusion in the stream. As you can tell, there could be as many active streams as there are key values. For example, there could be one active stream for each active Request ID being processed in the system.
 * Streams can die out from inactivity or be explicitly closed when an ending event is processed. For example, when compute.instance.run_instance.end arrives.
 * Once a stream is closed, the collected events are passed to the Operation plugins for processing as a whole. This could result in the generation of new notifications and/or new samples for storage. While events are mutable, the generation of new events or samples is preferred (with a back-link to the underlying event or stream).
 * Like the sample pipeline, operations are socially idempotent. They may be run many times on the same stream, but should not duplicate new data generated. For example, if an operation sends an email, it must only do so one-time. While this isn't actively enforced by the framework, notifications do support re-queuing on an error (cast rpc calls do not).
 * Temporal ordering is a core feature of the trigger pipeline. Even with multiple collectors, the final stream will contain all related events and in the correct order.
 * Event pipelines can be immediate, like the sample pipeline, processing events as they come in (they'd sit between the event converter and the dispatcher). This is useful for filtering. Or they can be attached to a trigger. Triggers can have 2 pipelines attached, a fire pipeline for when the trigger criteria is met, and an optional expire pipeline if the trigger's TTL expires (this lets you generate events on timeouts for some activity)
 * At the end of the event pipeline, the data can be passed back to a dispatcher to save any newly synthesized events, or update changes (some changes required for this).

Areas for Improvement
Since the trigger pipeline is currently under active development, please update the related blueprint
 * https://blueprints.launchpad.net/ceilometer/+spec/notifications-triggers
 * https://blueprints.launchpad.net/ceilometer/+spec/notification-pipelines

From here we can see how events will be able to come into the system and how complex samples might be extracted from the event stream. Also, we've shown how events can produce samples and vice-versa. This is a rich feedback loop capable of producing some valuable information from OpenStack services.

Part 3 – The Alarm Framework
So, let's talk about the alarm framework in its current form ...



Currently, the alarm infrastructure lives as two external services that consume and publish CM data via the CM public api.


 * The user (1) can publish alarm definitions to CM via the CM API.
 * The Alarm Evaluation Service (2) periodically pulls the alarm definitions from CM and evaluates them all (this could result in many other calls to CM for sample state)
 * If an alarm condition is raised (3) an rpc call is sent back to the Alarm Notification Service for processing
 * The alarm notification service picks up these alarm conditions and notifies third-party applications as defined by the user.

Short Term Areas for Improvement
If we want to continue with this approach there are some minor tweaks we could do:


 * The alarm service pulls down all alarms on each query for evaluation. These should be cached and only alarms that have changed since the last query returned.
 * The alarm evaluator really needs a refactoring to use better state management. The TaskFlow effort could really help here.
 * Again, the use of rpc cast means alarm notifications could easily be lost and there are no means for retries.
 * Alarms that require heavy statistical data can put quite a load on CM for each evaluation. Much can be done in the way of precomputation/caching to help this. This was discussed previously here: https://wiki.openstack.org/wiki/Ceilometer/Alerting#Precalculation_of_aggregate_values
 * We should unify the alarm and sample pipeline publisher mechanisms. How many email/rpc/udp implementations do we need?
 * Arguably, alarming belongs in the CM main packaging or not (http://lists.openstack.org/pipermail/openstack-dev/2013-August/012862.html)

Again, these are minor alterations.

Let's look at another way to tackle this problem ...

Part 4 – Moving Alarms into the Pipelines


We can solve the same problem in the following way:


 * Notifications come into the system as before (1)
 * The raw notifications are converted to samples and/or events using the methods described previously (2)
 * We would have sample pipeline (3) or trigger pipeline (4) plug-ins for alarm evaluation. These plug-ins would perform the same function as the alarm evaluation service.
 * There would be a common alarm evaluation library used by both sets of plug-ins, but they would look for alarm conditions either by high-low marks from the sample pipeline or critical events or series of events from the trigger pipeline.
 * When an alarm is raised, a new notification would be published to the queue topic (5) and picked up by the collector where a different set of pipelines might handle it.
 * We would be under the assumption that there is no difference between a “transformer” and a “publisher” and that any pipeline plugin could notify outside the system if so configured.
 * By unifying publishers and alarm notifiers we reduce code while extending the overall functionality of the system.

Smarter Evaluations
Alarms should be modeled internally as a directed acyclic graph (DAG), such that we only have to watch for nodes (events or samples) that might affect an alarm rule. If we see such an event/sample we have a very small set of alarm rules to process.

This is the opposite of the current scheme of running though all the alarms, fetching values and looking for alarm conditions. Instead, we look at samples and events and see if they're interesting or not.

Consider the following diagram:



Here, we have two alarm rules (host_busy and host_very_busy). The host_busy rule is triggered when the cpu and disk on a given host are over 80% capacity. This is a classic sample pipeline evaluation. When we see the host.cpu or host.disk samples coming through we would evaluate this rule. Otherwise we would never check it. When the host is busy, we generate a new notification called “$host.busy” (where $host is the hostname).

The second rule depends on seeing this $host.busy event. When one of these events come in, we look to see if we've seen more than 10 of them within the last hour. This is something the trigger pipeline can handle. If we do see a lot of .busy events, we generate a host.very_busy event. Again, we only evaluate this rule when we see a new .busy event come through.

We could have more sample/trigger pipelines that look for .very_busy events and have them send emails or talk to any of the stock notifiers.

Internally, this could be as simple as a map of sample names to the list of dependent alarm rules: {	“$host.cpu” : [“host_busy”, ], “$host.disk”: [“host_busy”, ], “$host.host_busy”: [“host_very_busy”, ] “$host.host_very_busy”: [“trip_nagios”, “email_ops”, “warn_heat”, ] ... }

Points to Ponder ...

 * 1. There is still the question of the CM API knowing about alarm specific concerns and the CM database having to contain alarm-related tables. We can discuss further if this merits being moved out into a separate package.
 * 2. The caching and building of the alarm DAG data structure be handled by a dedicated pipeline. alarm_change notifications should be issued by the API and the evaluation DAG updated accordingly. This would have to be a fan-out event so that all collectors are updated.
 * 3. Does the trigger pipeline save the list of events that constituted the stream (so we don't have to re-compute it again later if needed)? Note: the pipeline could generate an event that contains the list of events in the payload. This way we get the retries and storage just like any other event.
 * 4. Could we not use reimann.io?
 * 5. With per-meter topics does that mean we need one collector per meter?
 * 6. Events may want to use statistics as well … tbd.
 * 7. Any meter/event statistic caching should be moved down into the storage layer (or externally) and kept away from the api layer so that plugins (that don't use the api) can benefit.
 * 8. We may want to separate internal notifications from OpenStack service notifications so we can monitor queue activity and assist operations in finding problems.
 * 9. A concern has been raised that alarms require access to the stats module, which is only available from the API and should be done in real-time (vs. precomputed). I think this is a misplaced concern. When dealing with Samples, there are tools like statsd which were created exactly to solve this problem because depending on the database doesn't scale. We should be sending our Samples to statsd for in-memory aggregation and relaying. Also, the performance of these statistical operations will largely depend on how the underlying schema's are represented in the nosql/sql databases. A better approach is to remove those from the equation and pre-compute.

TL;DR
The power of Ceilometer comes from being a highly efficient tool for collecting events, generating samples and making this data available to external systems. The sample and trigger pipelines provide a clear path for achieving this goal. Alarm processing is an important feature of a monitoring system, but alarm evaluation can be built to leverage these efforts instead of being something orthogonal code-wise.