Ceilometer/Alerting

Alarm definition

Proposition for alarm definition, not reviewed yet, remarks are welcome.

   {   
       "id": "123456789",
       "name": "SwiftObjectAlarm",
       "description": "A alarm is raise when 2 aggregates of 240 seconds are greater than 2.0 swift objects",
       "timestamp": "2013-04-08T05:17:13.698331", # timestamp of last update 
       "counter_name": "storage.objects",
       "user_id": "a3b70c53b94648438b442b485606e7cf", # owner of the alarm
       "project_id": "bfe523aebf2f4e5e9a997a9452927811", # project owner of the alarm
       "aggregate_period": 240.0,
       "evaluation_period": 2,
       "statistic": "average",
       "metadatas": {"project_id": "c96c887c216949acbdfbd8b494863567"}, # the project_id used to match a sample or not        "comparison_operator": "gt",
       "threshold": 2.0,
       "state" : 1,
       "state_timestamp": "2013-04-08T05:17:13.698331",
       "alarm_actions": [ "http://site:8000/%(state_text)s" ],
       "ok_actions": [ "http://site:8000/%(state_text)s" ],
       "insufficient_data_actions": [ "http://site:8000/%(state_text)s" ]
   }

(from: https://review.openstack.org/#/c/22671/)

API definition

Proposition for a initial API definition, a classic CRUD REST API, not reviewed yet, remarks are welcome.

[GET ] /alarms -- list the alarms

It return a list of alarm like the one in the alarm definition Return a list of alarm (like the one describe above) and 200 if success

[GET ] /alarms/<alarm> -- get the alarm description

It return the alarm with the id <alarm> like the one in the alarm definition Return a alarm (like the one describe above) and 200 if success

[POST ] /alarms -- add a alarm

This add a alarm with the content of the request body Return no body and 201 if success

[PUT ] /alarms/<alarm> -- update the alarm

This update the alarm with id <alarm> with the content of the request body Return no body and 200 if success

[DELETE] /alarms/<alarm> -- delete the alarm

Its delete the alarm with id <alarm> with all precalculated aggregate of metrics Return no body and 200 if success

On the second hand, we must allow:

to retreive alarm history state change
to retrieve aggregated_metrics that match a alarm.
...

Alarm Action Endpoint definition

The alarm action may be either 'log://', which case the alarm will be logged in your Ceilometer log file, or it can be an http(s) endpoint (i.e., a URL), in which case the contents of the alarm will be POSTed, JSON-formatted, to the URL specified.

The three alarm states of ALARM, OK and Insufficient Data POST the following data to the URL:

Status is ALARM

   {"current": "alarm", "alarm_id": "742873f0-97f0-4d99-87da-b5f7c7829b7f", "reason": "Remaining as alarm due to 1 samples outside threshold, most recent: 0.138333333333", "previous": "alarm"}

Status is OK

   {"current": "ok", "alarm_id": "742873f0-97f0-4d99-87da-b5f7c7829b7f", "reason": "Remaining as ok due to 1 samples inside threshold, most recent: 0.138333333333", "previous": "ok"}

Status is Insufficient Data

   {"current": "insufficient data", "alarm_id": "742873f0-97f0-4d99-87da-b5f7c7829b7f", "reason": "1 datapoints are unknown", "previous": "ok"}

Metric Storage definition

Precalculation of aggregate values

One subject of recent discussion is whether aggregates (sum, min, max, average etc.) of metric datapoints should be aggressively precalculated as each individual datapoint is ingested, then stored in the main metric store so that aggregate values are always available when alarm threshold evaluation occurs.

While this approach seems attractive at first blush, there are a number of potential issues to consider:

how long to keep the aggregate values for? (queries based on recent datapoints will strongly dominate for alarm threshold evaluation, though less so for charting/trending applications)
what period to precalculate for? (query period may be different from the "natural" cadence of the metric)
which aggregates to precalculate? (need to be aware a priori of which aggregates will be queried)
how to handle the pathological case for wide-dimension metrics? (e.g. 50 instances in an autoscaling group reporting backfilled metrics would lead to recalculation of the average a total of 50 times instead of once)
is the overhead prohibitive to recalculate more exotic future aggregates such as percentiles?

A variation on the precalculation-on-ingestion approach is to delay the aggregate calculation until the evaluation period is considered "finished". However this is difficult to determine accurately, as we need to allow some margin for backfilling of late metric datapoints.

An alternative approach is to calculate the aggregate for a specific period only if and when this is actually queried, and then persist the value in a distributed cache available to the alarm threshold evaluation layer (e.g. based on memcached or redis). Since this cache would be intended only for use by the alarm threshold evaluation layer, the retention period can be tuned specifically for this purpose, where the time window of interest generally relates to the recent past. The cached aggregate value(s) would need to be marked as dirty if a further backfilled datapoint is received for the period in question. The on-demand nature of the aggregate calculation in this approach would ensure that we avoid precalculating unnecessarily for an aggregate or period that's never queried.