Ceilometer/blueprints/alarm-audit-api-group-by

The user stories
- Each instance is represented by number of alerts in given time period.
 * User wants health chart of nova instances (health is represented by number of alerts)

- Instances can be grouped by metadate resouce_id, meter_names and other alarm attributes (all needs to be saved in Alarm event, so the change of Alarm won't affect it).

- Instanced can be queried by regular ceilometer query, also by a state transition (user is interested only in transitions to 'alarm' states)

Result will be something like

[ {    'resource_id': 'x', 'meter_name':' y', 'count' => 'z', 'timestamp'... }   , ... ] -  it was grouped by resource_id an meter_name, but I could group by only by resource_id

Same as 1. but it would be grouped by time ( limit )
 * User wants to display a health of multiple instances in time line

Result would be something like this

[[ {    'resource_id': 'x',      'meter_name':' y',      'count' => 1,      'timestamp'...    },     {'resource_id': 'x',      'meter_name':' y',      'count' => 2,      'timestamp'...     }]   ,   [{     'resource_id': 'x2',      'meter_name':' y',      'count' => 3,      'timestamp'...    },     {'resource_id': 'x2',      'meter_name':' y',      'count' => 4,      'timestamp'...     }], ...]


 * User wants to see health map of projects

Instead of resource ID in the query, there will be a project ID. For making this work also for Alarms defined for resource ID. Alarm for resource would have to always contain also a Project.


 * User wants to be able to order the results by the all result attributes (especially count) and limit the results

Discussion
[eglynn | 12.10.2013]

Just playing devil's advocate here, but I have a number of concerns about the direction this is going in'''.


 * 1. this approach seems to pre-suppose a strong and static link between alarms on the one hand and resources on the other, whereas no such link exists in the current implementation. Instead alarms are simply defined on the basis of some criteria that's used to bound a statistics query (the so-called "matching metadata") - that's it, no more, no less, and deliberately so. These criteria can result in the statistics compared to the alarm threshold deriving from one or many resources, also the set of resources that feed into a particular alarm can change over time.


 * We specifically wanted to avoid the scenario found in AWS CloudWatch Alarms, whereby widely dimensioned alarms can only be defined in very specific, limited ways. Instead for ceilometer, the idea was to leave the dimensioning completely free-format, so that anything that could be expressed as a equality-based metadata query could be used as the alarm matching metadata rule.


 * So I'd be concerned that this approach is attempting to refrofit a static resource<-->alarms mapping that we (IIUC) tried to avoid in the original design.


 * 2. this approach seems to conflate the state of some set of alarms with the overall "health" of the resource, which seems like a faulty assumption to me, for several reasons ...


 * it depends on the alarm rule as how bad a reflection of the resource health that each transition to the alarm state actually represents. For example, it would make perfect sense to define an IdleCpu alarm with a rule that required max CPU util to remain under 10% for some period. Does the fact this alarm fires indicate anything "bad" about the health of the resource?
 * it may be the case that the alarm does indeed indicate some badness has occurred, but that may be aggregated over many resources and not necessarily a reflection on the health of an individual resouce. Say for example an alarm is based on the max CPU utilization across an autoscaling group being in excess of 90%. If this alarm fires because a single outlier in the autoscaling group is extremely over-loaded (due to say some unlucky load-balancing), does that say anything bad about the many other instances in the group that might be tipping happily along with CPU utilization below 50%?


 * So essentially it seems to me that a simple count of the number of alarms associated with a particular resource that fired over some period is not a very useful measure of the health of that individual resource. In fact it may be quite misleading - the instance (image, volume, whatever) could have a high alarm count yet still be very healthy, or vice versa.

Just some food for thought ...

[lsmola | 13.10.2013]

I should probably change a name of this blueprint, it should say alarm-events-api (in a similar way the sample-api bp is created)

I would try to reduce the concerns a bit in few points:
 * 1. An general alarm-events-api


 * There should be a way how to query the alarm events across the alarms. This is not creating any assumptions about the data.


 * 2. Identifying a bad behavior  (Undercloud use)


 * Identifying a bad behavior won't be an easy task, I do realize it could be hard or impossible to achieve with current meters. Though a main use-case for this is not monitoring VM's in Overcloud, but the actual bare-metals in Undercloud, that holds the VM's (checkout the tuskar and tripleo for more). There is a Hardware Agent for collecting bare-metal metrics in progress. We will probably need to add some more metrics to be able to check the real bad behavior.


 * Though this all then relies on the fact, that I am able to query the events in a general way.


 * You can checkout the wireframe of how the health chart could look like in the future http://file.brq.redhat.com/~jcoufal/openstack-m/user_stories/racks_detail-overview.pdf


 * 3. The useful queries over events


 * To be true I am not yet sure how we want to display the alarms for Overcloud users in Horizon. And I do not want to make assumptions of what alarms will users want to track and how will they connect it to the resources or some aggregates. Nor whether they want to track them because it points to something bad, interesting or even good behavior. This will be a kind of brain teaser for the horizon community in the near future.


 * The alarm-event-api should allow the general queries. So once we decide that some alarms of some meter (or meters) are interesting for us to display, we should be able to do that. The health chart is just one example. It could track e.g. cpu_util alarms with condition > 90%, so I could just get with one query, that there were 1000 alarms of this type in one week. But again this is just an example.


 * 4. The group by


 * The group by should be just optional. So without that, it should actually return me all the alarm events matching the query. E.g if I want a list of alarm events (that had actual alarm transition) for last week for cpu_util, I should be able to do that in one query.


 * I wonder whether we need more aggregate methods for group-by than just a COUNT

Thank you very much for the feedback. I am expecting a bigger conversation about this once there will be first Wireframes for the Horizon, that contains some charts based on Alarms. Then we will have something more real for the Horizon. ( right now, there is at least the use case for the Undercloud ). I will change the user stories after our discussion, cause they don't fully show what this bp should be about.

[eglynn | 15.10.2013]

OK, so I've been mulling this over for a while, and here's a radical thought to consider - alarms actually get in the way of what you want to achieve, instead you'd be able to extract a more accurate (and actionable) picture directly from the underlying statistics.

This is primarily because you've no direct control over the criteria associated with these user-alarms and no good way of determining the signal-to-noise ratio from these alarms in the aggregate.

If I understand correctly, the idea here is draw attention to the resources that have had questionable health status over the recent past, say the last week. This questionable health status can be thought of in terms of some well-understood measure such as CPU util > 90%. So why not simply cut out the middleman (i.e. the user-defined alarms) and go straight for the underlying statistics?

So instead of asking the question: "how many cpu_util-related alarms fired against this resource in the past week", you would ask more direct questions such as "for how long was the CPU util for this instance up over 90%" or "which instances had CPU util in excess of 90% at some point over the past week".

[lsmola | 16.10.2013]

Interesting point. That could actually work and the time-series chart could be more verbose, if it will be showing e.g. real cpu_util over time with line on 90%, instead of just alarm/ok over time. Also using a condition defined in alarm, I should be able to use it for query the statistics directly and adding any aggregate query to this, right?

Only thing I don't know, is how to handle a logical combinations of alarms. I am not sure, but the queries do not support all logical combinations, right? So in this case, the alarms serves as a wrappers of conditions, that could be hard or impossible to get directly from the stats (at least it will be a costly computation I think) https://blueprints.launchpad.net/ceilometer/+spec/alarming-logical-combination - I might be missing something there, so this assumption can be wrong from the start :-) - We will probably allow user to add his own meter to the Health measurement ( probably even to build his own stats pages ). Because customers are allowed to define their own pollsters and metrics, they should be able to view what they need. So this probably can't be harcoded -> we should count with any combination of alarms.

I am in the process of collecting all needed metrics and alerts we need for now. So I will check, what can be obtained from statistics and what would need this BP.

[eglynn | 15.10.2013]

> Also using a condition defined in alarm, I should be able to use it for query the statistics directly and adding any aggregate query to this, right?

Yes, it's straight-forward to determine from the threshold evaluation logs what the statistics query was. The difference is that this query would be executed continually over a short time frame (the sliding evaluation window, which depends on the evaluation periods and period configured for the alarm). Whereas you'd be more interested in the maxima over a longer duration such as the past week.

> I am not sure, but the queries do not support all logical combinations, right?

For normal alarms no, but there is a new interation of API enabling logical combination of alarms just proposed.

> We will probably allow user to add his own meter to the Health measurement

That's an interesting idea, so would you envisage this user-defined meter being distinguished in some way, e.g. on the basis of a naming convention? (e.g. meter name having a 'health_' prefix)

Would the user also be responsible for POSTing the samples associated with their custom meter?

[lsmola | 16.9.2013]

> I am not sure, but the queries do not support all logical combinations, right?

''> > For normal alarms no, but there is a new interation of API enabling logical combination of alarms just proposed.

The question was more about whether I am able to do the same query over statistics, that I am able to do over the Alarm Logical Combination. When I have composite alarm, that fires upon condition: composite_alarm = (alarm_1 OR alarm_2) AND alarm_3. I will not able to skip the Alarm layer and use just the statistics query, to show the real values of composite_alarm. For this it seems I can do it only via alarm-events-api. Right? Or I am missing something?

> We will probably allow user to add his own meter to the Health measurement

''> >That's an interesting idea, so would you envisage this user-defined meter being distinguished in some way, e.g. on the basis of a naming convention? (e.g. meter name having a 'health_' prefix)

''> >Would the user also be responsible for POSTing the samples associated with their custom meter?

We will probably add a whole bunch of meters with their own pollsters, getting samples from baremetals, so yes the whole process including getting samples. Not sure about distinguishing them, though. Some naming convention could be handy.

Also I am thinking, that distinguishing system-alarm and user-alarm could be a good idea. If the alarm is used for some automatic decisions (like auto-scaling with Heat alarms), I would take it as a system one and treat it carefully. So maybe some way to tag-alarms? So, I could e.g. set my own alarms, tag them as a network-error and then show chart of network errors by querying them. What do you think?

[eglynn | 17.9.2013]

''> I will not able to skip the Alarm layer and use just the statistics query, to show the real values of composite_alarm. For this it seems I can do it only via alarm-events-api. Right? Or I am missing something?''

One way of viewing alarms would be as a simple convenience layer over the statistics API. Instead of querying directly & constantly, an alarm allows the user to have a certain condition against a statistics query to be frequently checked and notifications generated when a threshold is crossed. However at the core of the mechanism is always a query on the statistics API (or multiple queries in the case of combination alarms, which always eventually map onto underlying threshold-oriented alarms).

So my point is simply that these equivalent statistics queries can always be issued directly, without the alarming system getting getting in the way. The real value provided by alarms is around constancy and currency - the alarm threshold evaluation service frequently invokes the equivalent statistics query on the most recent samples available.

However in your use-case, IIUC, the time horizon is much longer. Instead of wanting an up-to-date notification that some condition has been met for the most recent statistics, you're more interested in knowing how the statistic trend evolved over a longer time-span, e.g. the past week - how much of that time the statistic had some value that could be considered anomolous for instance. Is that a correct interpretation? If so, then it seems to me that deriving this from the primary source (i.e. the statistics API) is a surer approach than relying on alarms defined outside your direct control.

> Also I am thinking, that distinguishing system-alarm and user-alarm could be a good idea. That is certainly an interesting idea. Now we have no good way of capturing that distinction currently. As far as ceilometer is concerned, an alarm is just an alarm, regardless of whether it was created by the Heat engine or a normal user. What would be the major distinguishing factor in your mind - that a "system alarm" was created by an adminstrative user, or by a service that forms part of the openstack "infrastructure", or something else?

[lsmola | 18.9.2013]

''> However in your use-case, IIUC, the time horizon is much longer. Instead of wanting an up-to-date notification that some condition has been met for the most recent statistics, you're more interested in knowing how the statistic trend evolved over a longer time-span, e.g. the past week - how much of that time the statistic had some value that could be considered anomolous for instance. Is that a correct interpretation? If so, then it seems to me that deriving this from the primary source (i.e. the statistics API) is a surer approach than relying on alarms defined outside your direct control.

Yes, though Statistics query doesn't allow to do same queries as you can achieve with Composite alarms, right? You can't do e.g. (project_1.cpu_util > 90% AND (project_1.memory > 90% OR project_1.swap > 30%) by one statistic query right? You could do only (project_1.cpu_util > 90% AND project_1.memory > 90% AND project_1.swap > 30%). If I understand correctly how is querying implemented.

""> That is certainly an interesting idea. Now we have no good way of capturing, that distinction currently. As far as ceilometer is concerned, an alarm is just an alarm, regardless of whether it was created by the Heat engine or a normal user. What would be the major distinguishing factor in your mind - that a "system alarm" was created by an adminstrative user, or by a service that forms part of the openstack "infrastructure", or something else?

Oh yeah, if e.g. the Heat alarms will be created by the Heat user, I think that could be enough. Otherwise we would need to tag them somehow.

[eglynn | 19.9.2013]

IRC log of further discussion.