Jump to: navigation, search

Difference between revisions of "Vitrage/Blueprints/templates"

(Alarm Aggregation)
Line 52: Line 52:
 
<u>Resulted Action:</u>
 
<u>Resulted Action:</u>
 
* Mark aggregation relations on the existing alarms
 
* Mark aggregation relations on the existing alarms
 +
 +
 +
== Requirements ==
 +
The selected solution should support the following requirements.
 +
 +
* '''Generic.''' It should support the described use cases, and be open for future use cases.
 +
* '''Extendable and Configurable.''' It should be easy to add new behaviors, even to an already-working environment. An end-user should be able to easily define a wanted behavior on his own.
 +
* '''Flexible Syntax.''' There are many use-cases that could be considered, some simple some complex. Example of complex conditions:
 +
** If a fan stopped working, raise a warning on all related vms; if 2 fans stopped working, raise an error; if all fans stopped working, raise a critical error
 +
** If alarm A or alarm B was raised, then raise alarm C
 +
** If alarm A was raised, and alarm B was not raised, then raise alarm C
 +
* '''Overlapping conditions.''' Since every condition is defined independently, some conditions may overlap and even contradict one another. This is allowed; the conditions should be processed with caution, and there should be a clear definition of the expected behavior. For example:
 +
** Template 1: If alarm A was raised on a host, and the host contains a vm, then raise alarm C on the vm
 +
** Template 2: If alarm B was not raised on a host, and the host contains a vm, then disable alarm C on the vm

Revision as of 11:16, 11 October 2015

Introduction

In Vitrage we plan on analyzing patterns of alarms and other system events, and perform actions when a pattern is detected. In order to support complex algorithms like RCA, deduced alarms, alarm aggregations etc., we need a way to express the conditions/triggers for the calculation and the resulted actions. We should define a language with a logical representation of, for example:

  • If
    • Ceilometer agent is down on host1 and
    • host1 contains vm1 and
    • we failed to get metrics on vm1
  • Then
    • Determine that the root cause of {failure to get metrics on vm1} is {Ceilometer agent down on host1}


Another example:

  • If
    • There is high CPU on host2 and
    • host2 contains several vms
  • Then
    • Deduce that the hosted vms have sub-optimal performance
    • Raise alarms on these vms
    • Set their states to sub-optimal


The defined language should let us describe conditions on different kinds of resources, their properties and the relations between them; We should then be able to define which actions should be taken if the conditions are met, e.g. determine RCA, raise/disable alarms, set resources states, etc. The model should be generic and flexible, so defining a new rule would not require change of code.


Use Cases

Direct Causal Relationship Calculation

In this use case, we want to indicate that given a specific configuration of related resources and alarms on them, one (or more) alarm is caused by another alarm.

Condition:

  • A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.

Resulted Action:

  • Determine and mark the "causes" relation between the alarms.

Deduced Alarms and States

According to a certain state of the physical/virtual resources and/or alarms that were raised, we can deduce that there must be other problems in the system, even if we got no specific alarms about them. In that case, we would like to trigger the relevant alarms, and or modify the states of the relevant resources.

Condition:

  • A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.

Resulted Action:

  • Raise alarms
  • Modify resources states

Alarm Aggregation

In case there are many alarms in the system, identifying the most important ones might not be so easy. We would like to aggregate the alarms based on certain criteria, for example:

  • Aggregate by root cause; by default show only the root cause alarm, and allow to drill-down to all other alarms
  • Aggregate by a specific resource
  • Aggregate by alarm type


Condition:

  • A combination of physical resources, virtual resources and alarms, optionally with conditions on their properties.

Resulted Action:

  • Mark aggregation relations on the existing alarms


Requirements

The selected solution should support the following requirements.

  • Generic. It should support the described use cases, and be open for future use cases.
  • Extendable and Configurable. It should be easy to add new behaviors, even to an already-working environment. An end-user should be able to easily define a wanted behavior on his own.
  • Flexible Syntax. There are many use-cases that could be considered, some simple some complex. Example of complex conditions:
    • If a fan stopped working, raise a warning on all related vms; if 2 fans stopped working, raise an error; if all fans stopped working, raise a critical error
    • If alarm A or alarm B was raised, then raise alarm C
    • If alarm A was raised, and alarm B was not raised, then raise alarm C
  • Overlapping conditions. Since every condition is defined independently, some conditions may overlap and even contradict one another. This is allowed; the conditions should be processed with caution, and there should be a clear definition of the expected behavior. For example:
    • Template 1: If alarm A was raised on a host, and the host contains a vm, then raise alarm C on the vm
    • Template 2: If alarm B was not raised on a host, and the host contains a vm, then disable alarm C on the vm