Ceilometer/blueprints/monitoring


 * Launchpad Entry: CeilometerSpec:monitoring
 * Created: 28 Nov 2012
 * Contributors: Angus Salkeld

Summary
Note this is a big spec and where possible it is broken down into sub-specs to make it easier to share work.

User stories
The purpose of Alarms is to notify a user when a meter matches a certain criteria.

Some examples

"Tell me when the maximum disk utilization exceeds 90%" "Tell me when the average CPU utilization exceeds 80% over 120 seconds" "Tell me when my web app is becoming unresponsive" (loadbalancer latency meter) "Tell me when my httpd daemon dies" (custom user script that checks daemon health)

How can you use Alarms
Create an alarm

{ 'period': '300', 'eval_periods': '2', 'meter': 'CPUUtilization', 'function': 'average', 'operator': 'gt', 'threshold': '50' 'resource_id': 'inst-002', 'source': 'OS/compute', 'alarm_actions': ['rpc/my_notify_topic', 'http://bla.com/bla'], 'ok_actions': ['rpc/my_notify_topic'] }

This will check the "CPUUtilization" meter events every 300sec and if the average CPUUtilization was > 50% (for inst-002) for both of the last 2 300sec periods then it will send an rpc notification on the "my_notify_topic" topic and post the alarm details to http://bla.com/bla.

Then when the alarm goes below this level it will do the "ok_actions".

Assumptions
these are really the same kinds of meters that ceilometer currently samples
 * 1) We are trying to deliver CloudWatch-like functionality but in an "openstack way" that can be extended.
 * 2) Kinds of metrics to monitor: http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/CW_Support_For_AWS.html
 * 1) Sample at between 10s to 60s, and Transmit at between 1min and 5min
 * 2) try to reuse as much of the current ceilometer code as possible so that the features that we add can be used by metering ceilometer.

Design
The idea is to use most of ceilometer as-is, so the program flow is:

Data Insertion

The publisher has an option to emit samples at a faster rate (say 60sec). It does so through a different transport that is more efficient than rpc and doesn't interfer with metering.

We could run a different (or the same - to be decided) collector that inserts the samples into the db as is done now (only transport different).

"Blueprint: https://blueprints.launchpad.net/ceilometer/+spec/multi-publisher"

API Auth

The API needs to be accessible by non-admin, to get the user's own data and control their own alarms. This should not be a problem and work is planned for this.

"Blueprint: https://blueprints.launchpad.net/ceilometer/+spec/user-api"

Data Query

To handle aggregate queries (for autoscaling groups) we need to extend the query mechanism to be able to get statistics over a defined set of resources (usually the info is in the metadata).

"Blueprint: https://blueprints.launchpad.net/ceilometer/+spec/multi-dimensions"

We need to extend the API to be able to list the meter types across the resources in a tenant.

We need to support more statistics functions: max, min, average, count within a defined period.

"Blueprint: https://blueprints.launchpad.net/ceilometer/+spec/api-aggregate-average"

Support Posting new sample data

"Blueprint: https://blueprints.launchpad.net/ceilometer/+spec/meter-post-api"

Alarm Detection

TODO

Alarm Notification

TODO

Implementation
This section should describe a plan of action (the "how") to implement the changes discussed. Could include subsections like:

API Changes

 * new alarm rest resource
 * new alarm history rest resource
 * need changes to make statistics aggregation more flexible
 * need a new post meter data API
 * need a new list meters API

Code Changes
Code changes should include an overview of what needs to change, and in some cases even the specific details.

Migration
Include:
 * data migration, if any
 * redirects from old URLs to new ones, if any
 * how users will be pointed to the new way of doing things, if necessary.

Test/Demo Plan
This need not be added or completed until the specification is nearing beta.

Unresolved issues
This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion
Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.