Ceilometer/blueprints/alarm-audit-api-group-by

The user stories

User wants health chart of nova instances (health is represented by number of alerts)

- Each instance is represented by number of alerts in given time period.

- Instances can be grouped by metadate resouce_id, meter_names and other alarm attributes (all needs to be saved in Alarm event, so the change of Alarm won't affect it).

- Instanced can be queried by regular ceilometer query, also by a state transition (user is interested only in transitions to 'alarm' states)

Result will be something like

 [ {
    'resource_id': 'x', 
    'meter_name':' y', 
    'count' => 'z',      
    'timestamp'...
    }
    , ... ]

- it was grouped by resource_id an meter_name, but I could group by only by resource_id

User wants to display a health of multiple instances in time line

Same as 1. but it would be grouped by time ( limit )

Result would be something like this

[[ {
    'resource_id': 'x', 
    'meter_name':' y', 
    'count' => 1, 
    'timestamp'...
   }, 
   {'resource_id': 'x', 
    'meter_name':' y', 
    'count' => 2, 
    'timestamp'...
    }]
   ,
  [{
    'resource_id': 'x2', 
    'meter_name':' y', 
    'count' => 3, 
    'timestamp'...
   }, 
   {'resource_id': 'x2', 
    'meter_name':' y', 
    'count' => 4, 
    'timestamp'...
    }], ...]

User wants to see health map of projects

Instead of resource ID in the query, there will be a project ID. For making this work also for Alarms defined for resource ID. Alarm for resource would have to always contain also a Project.

User wants to be able to order the results by the all result attributes (especially count) and limit the results

API definition

Discussion

Just playing devil's advocate here, but I have a number of concerns about the direction this is going in.

1. this approach seems to pre-suppose a strong and static link between alarms on the one hand and resources on the other, whereas no such link exists in the current implementation. Instead alarms are simply defined on the basis of some criteria that's used to bound a statistics query (the so-called "matching metadata") - that's it, no more, no less, and deliberately so. These criteria can result in the statistics compared to the alarm threshold deriving from one or many resources, also the set of resources that feed into a particular alarm can change over time.

We specifically wanted to avoid the scenario found in AWS CloudWatch Alarms, whereby widely dimensioned alarms can only be defined in very specific, limited ways. Instead for ceilometer, the idea was to leave the dimensioning completely free-format, so that anything that could be expressed as a equality-based metadata query could be used as the alarm matching metadata rule.

So I'd be concerned that this approach is attempting to refrofit a static resource<-->alarms mapping that we (IIUC) tried to avoid in the original design.

2. this approach seems to conflate the state of some set of alarms with the overall "health" of the resource, which seems like a faulty assumption to me, for several reasons ...

it depends on the alarm rule as how bad a reflection of the resource health that each transition to the alarm state actually represents. For example, it would make perfect sense to define an IdleCpu alarm with a rule that required max CPU util to remain under 10% for some period. Does the fact this alarm fires indicate anything "bad" about the health of the resource?
it may be the case that the alarm does indeed indicate some badness has occurred, but that may be aggregated over many resources and not necessarily a reflection on the health of an individual resouce. Say for example an alarm is based on the max CPU utilization across an autoscaling group being in excess of 90%. If this alarm fires because a single outlier in the autoscaling group is extremely over-loaded (due to say some unlucky load-balancing), does that say anything bad about the many other instances in the group that might be tipping happily along with CPU utilization below 50%?

So essentially it seems to me that a simple count of the number of alarms associated with a particular resource that fired over some period is not a very useful measure of the health of that individual resource. In fact it may be quite misleading - the instance (image, volume, whatever) could have a high alarm count yet still be very healthy, or vice versa.

Just some food for thought ...