Difference between revisions of "Vitrage/Blueprints/templates"

Revision as of 11:09, 11 October 2015

Introduction

In Vitrage we plan on analyzing patterns of alarms and other system events, and perform actions when a pattern is detected. In order to support complex algorithms like RCA, deduced alarms, alarm aggregations etc., we need a way to express the conditions/triggers for the calculation and the resulted actions. We should define a language with a logical representation of, for example:

If
- Ceilometer agent is down on host1 and
- host1 contains vm1 and
- we failed to get metrics on vm1
Then
- Determine that the root cause of {failure to get metrics on vm1} is {Ceilometer agent down on host1}

Another example:

If
- There is high CPU on host2 and
- host2 contains several vms
Then
- Deduce that the hosted vms have sub-optimal performance
- Raise alarms on these vms
- Set their states to sub-optimal

The defined language should let us describe conditions on different kinds of resources, their properties and the relations between them; We should then be able to define which actions should be taken if the conditions are met, e.g. determine RCA, raise/disable alarms, set resources states, etc. The model should be generic and flexible, so defining a new rule would not require change of code.

Use Cases

Direct Causal Relationship Calculation

In this use case, we want to indicate that given a specific configuration of related resources and alarms on them, one (or more) alarm is caused by another alarm.

Condition:
- A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.
Resulted Action:
- Determine and mark the "causes" relation between the alarms.

Deduced Alarms and States

According to a certain state of the physical/virtual resources and/or alarms that were raised, we can deduce that there must be other problems in the system, even if we got no specific alarms about them. In that case, we would like to trigger the relevant alarms, and or modify the states of the relevant resources.

Condition:
- A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.
Resulted Action:
- Raise alarms
- Modify resources states

Alarm Aggregation

In case there are many alarms in the system, identifying the most important ones might not be so easy. We would like to aggregate the alarms based on certain criteria, for example:

Aggregate by root cause; by default show only the root cause alarm, and allow to drill-down to all other alarms
Aggregate by a specific resource
Aggregate by alarm type

Condition:
- A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.
Resulted Action:
- Mark aggregation relations on the existing alarms

@@ Line 21: / Line 21: @@
 The defined language should let us describe conditions on different kinds of resources, their properties and the relations between them; We should then be able to define which actions should be taken if the conditions are met, e.g. determine RCA, raise/disable alarms, set resources states, etc. The model should be generic and flexible, so defining a new rule would not require change of code.
+== Use Cases ==
+=== Direct Causal Relationship Calculation ===
+In this use case, we want to indicate that given a specific configuration of related resources and alarms on them, one (or more) alarm is caused by another alarm.
+* Condition:
+** A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.
+* Resulted Action:
+** Determine and mark the "causes" relation between the alarms.
+=== Deduced Alarms and States ===
+According to a certain state of the physical/virtual resources and/or alarms that were raised, we can deduce that there must be other problems in the system, even if we got no specific alarms about them. In that case, we would like to trigger the relevant alarms, and or modify the states of the relevant resources.
+* Condition:
+** A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.
+* Resulted Action:
+** Raise alarms
+** Modify resources states
+=== Alarm Aggregation ===
+In case there are many alarms in the system, identifying the most important ones might not be so easy. We would like to aggregate the alarms based on certain criteria, for example:
+* Aggregate by root cause; by default show only the root cause alarm, and allow to drill-down to all other alarms
+* Aggregate by a specific resource
+* Aggregate by alarm type
+* Condition:
+** A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.
+* Resulted Action:
+** Mark aggregation relations on the existing alarms