Difference between revisions of "Ceilometer/blueprints/alarm-audit-api-group-by"

Revision as of 21:00, 15 September 2013

The user stories

User wants health chart of nova instances (health is represented by number of alerts)

- Each instance is represented by number of alerts in given time period.

- Instances can be grouped by metadate resouce_id, meter_names and other alarm attributes (all needs to be saved in Alarm event, so the change of Alarm won't affect it).

- Instanced can be queried by regular ceilometer query, also by a state transition (user is interested only in transitions to 'alarm' states)

Result will be something like

 [ {
    'resource_id': 'x', 
    'meter_name':' y', 
    'count' => 'z',      
    'timestamp'...
    }
    , ... ]

- it was grouped by resource_id an meter_name, but I could group by only by resource_id

User wants to display a health of multiple instances in time line

Same as 1. but it would be grouped by time ( limit )

Result would be something like this

[[ {
    'resource_id': 'x', 
    'meter_name':' y', 
    'count' => 1, 
    'timestamp'...
   }, 
   {'resource_id': 'x', 
    'meter_name':' y', 
    'count' => 2, 
    'timestamp'...
    }]
   ,
  [{
    'resource_id': 'x2', 
    'meter_name':' y', 
    'count' => 3, 
    'timestamp'...
   }, 
   {'resource_id': 'x2', 
    'meter_name':' y', 
    'count' => 4, 
    'timestamp'...
    }], ...]

User wants to see health map of projects

Instead of resource ID in the query, there will be a project ID. For making this work also for Alarms defined for resource ID. Alarm for resource would have to always contain also a Project.

User wants to be able to order the results by the all result attributes (especially count) and limit the results

API definition

Discussion

[eglynn | 12.10.2013] Just playing devil's advocate here, but I have a number of concerns about the direction this is going in.

1. this approach seems to pre-suppose a strong and static link between alarms on the one hand and resources on the other, whereas no such link exists in the current implementation. Instead alarms are simply defined on the basis of some criteria that's used to bound a statistics query (the so-called "matching metadata") - that's it, no more, no less, and deliberately so. These criteria can result in the statistics compared to the alarm threshold deriving from one or many resources, also the set of resources that feed into a particular alarm can change over time.

We specifically wanted to avoid the scenario found in AWS CloudWatch Alarms, whereby widely dimensioned alarms can only be defined in very specific, limited ways. Instead for ceilometer, the idea was to leave the dimensioning completely free-format, so that anything that could be expressed as a equality-based metadata query could be used as the alarm matching metadata rule.

So I'd be concerned that this approach is attempting to refrofit a static resource<-->alarms mapping that we (IIUC) tried to avoid in the original design.

2. this approach seems to conflate the state of some set of alarms with the overall "health" of the resource, which seems like a faulty assumption to me, for several reasons ...

it depends on the alarm rule as how bad a reflection of the resource health that each transition to the alarm state actually represents. For example, it would make perfect sense to define an IdleCpu alarm with a rule that required max CPU util to remain under 10% for some period. Does the fact this alarm fires indicate anything "bad" about the health of the resource?
it may be the case that the alarm does indeed indicate some badness has occurred, but that may be aggregated over many resources and not necessarily a reflection on the health of an individual resouce. Say for example an alarm is based on the max CPU utilization across an autoscaling group being in excess of 90%. If this alarm fires because a single outlier in the autoscaling group is extremely over-loaded (due to say some unlucky load-balancing), does that say anything bad about the many other instances in the group that might be tipping happily along with CPU utilization below 50%?

So essentially it seems to me that a simple count of the number of alarms associated with a particular resource that fired over some period is not a very useful measure of the health of that individual resource. In fact it may be quite misleading - the instance (image, volume, whatever) could have a high alarm count yet still be very healthy, or vice versa.

Just some food for thought ...

[lsmola | 13.10.2013] I should probably change a name of this blueprint, it should say alarm-events-api (in a similar way the sample-api bp is created)

I would try to reduce the concerns a bit in few points:

1. An general alarm-events-api

There should be a way how to query the alarm events across the alarms. This is not creating any assumptions about the data.

2. Identifying a bad behavior (Undercloud use)

Identifying a bad behavior won't be an easy task, I do realize it could be hard or impossible to achieve with current meters. Though a main use-case for this is not monitoring VM's in Overcloud, but the actual bare-metals in Undercloud, that holds the VM's (checkout the tuskar and tripleo for more). There is a Hardware Agent for collecting bare-metal metrics in progress. We will probably need to add some more metrics to be able to check the real bad behavior.

Though this all then relies on the fact, that I am able to query the events in a general way.

You can checkout the wireframe of how the health chart could look like in the future http://file.brq.redhat.com/~jcoufal/openstack-m/user_stories/racks_detail-overview.pdf

3. The useful queries over events

To be true I am not yet sure how we want to display the alarms for Overcloud users in Horizon. And I do not want to make assumptions of what alarms will users want to track and how will they connect it to the resources or some aggregates. Nor whether they want to track them because it points to something bad, interesting or even good behavior. This will be a kind of brain teaser for the horizon community in the near future.

The alarm-event-api should allow the general queries. So once we decide that some alarms of some meter (or meters) are interesting for us to display, we should be able to do that. The health chart is just one example. It could track e.g. cpu_util alarms with condition > 90%, so I could just get with one query, that there were 1000 alarms of this type in one week. But again this is just an example.

4. The group by

The group by should be just optional. So without that, it should actually return me all the alarm events matching the query. E.g if I want a list of alarm events (that had actual alarm transition) for last week for cpu_util, I should be able to do that in one query.

I wonder whether we need more aggregate methods for group-by than just a COUNT

Thank you very much for the feedback. I am expecting a bigger conversation about this once there will be first Wireframes for the Horizon, that contains some charts based on Alarms. Then we will have something more real for the Horizon. ( right now, there is at least the use case for the Undercloud ). I will change the user stories after our discussion, cause they don't fully show what this bp should be about.

[eglynn | 15.10.2013]

OK, so I've mulling this over for a while, and here's a radical thought to consider - alarms actually get in the way of what you want to achieve, instead you'd be able to extract a more accurate (and actionable) picture directly from the underlying statistics.

This is primarily because you've no direct control over the criteria associated with these user-alarms and no good way of determining the signal-to-noise ratio from these alarms in the aggregate.

If I understand correctly, the idea here is draw attention to the resources that have had questionable health status over the recent past, say the last week. This questionable health status can be thought of in terms of some well-understood measure such as CPU util > 90%. So why not simply cut out the middleman (i.e. the user-defined alarms) and go straight for the underlying statistics?

So instead of asking the question: "how many cpu_util-related alarms fired against this resource in the past week", you would ask more direct questions such as "for how long was the CPU util for this instance up over 90%" or "which instances had CPU util in excess of 90% at some point over the past week".

@@ Line 60: / Line 60: @@
 == Discussion ==
+[eglynn | 12.10.2013]
 Just playing devil's advocate here, but I have a number of concerns about the direction this is going in'''.
@@ Line 108: / Line 108: @@
 Thank you very much for the feedback. I am expecting a bigger conversation about this once there will be first Wireframes for the Horizon, that contains some charts based on Alarms. Then we will have something more real for the Horizon. ( right now, there is at least the use case for the Undercloud ). I will change the user stories after our discussion, cause they don't fully show what this bp should be about.
+[eglynn | 15.10.2013]
+OK, so I've mulling this over for a while, and here's a radical thought to consider - alarms actually get in the way of what you want to achieve, instead you'd be able to extract a more accurate (and actionable) picture directly from the underlying statistics.
+This is primarily because you've no direct control over the criteria associated with these user-alarms and no good way of determining the signal-to-noise ratio from these alarms in the aggregate.
+If I understand correctly, the idea here is draw attention to the resources that have had questionable health status over the recent past, say the last week. This questionable health status can be thought of in terms of some well-understood measure such as CPU util > 90%. So why not simply cut out the middleman (i.e. the user-defined alarms) and go straight for the underlying statistics?
+So instead of asking the question: "how many cpu_util-related alarms fired against this resource in the past week", you would ask more direct questions such as "for how long was the CPU util for this instance up over 90%" or "which instances had CPU util in excess of 90% at some point over the past week".