Jump to: navigation, search

Vitrage

Revision as of 06:50, 12 October 2015 by Ifat Afek (talk | contribs)


What is Vitrage?

Vitrage is the Openstack RCA (Root Cause Analysis) Engine for organizing, analyzing and expanding OpenStack alarms & events, yielding insights regarding the root cause of problems and deducing the existence of problems before they are directly detected.

Mission & Scope

Vitrage is a project dedicated to making the events and alarms in OpenStack more meaningful and helpful. The ideal to which we strive is that every significant event in the system should have a timely alarm/event generated for it, that alarms are raised as early as possible after the event, and that the cause-effect relationships between different events is understood and visualized.

The Vitrage project is intended to be a part of Ceilometer project. It will get events and alarms from Aodh and from other OpenStack components. Whenever a new alarm is raised, it will process it and may produce RCA information, create alarm aggregation or raise new Alarms in Aodh.

Vitrage - proposed architecture

High Level Functionality

  1. Root Cause Analysis (RCA) for alarms/events
  2. Deduced alarms and states (i.e., raising an alarm or modifying a state based on analysis of system, not only direct monitoring)
  3. Alarm Aggregation (i.e., grouping alarms by categories, such as resources and severity, making them more manageable and understandable)
  4. Physical-to-Virtual entities mapping
  5. UI support for all features above

Use Cases

Baseline

We consider the following example, where a we are monitoring a Switch (id 1002), for example via Nagios test, and as a result an alarm is raised on a Switch. The following image depicts the logical relationship among different resources in the system that are related to this switch. Note the mapping between virtual (machine) and physical (host, switch) entities.

Baseline


Deduced alarms & states

The problems on the switch can, at times, have a bad impact on the VMs running on hosts attached to the switch, and we would like to have an alarm on those VMs to indicate this, as shown here:

Deduced Alarm

As can be seen, an alarm is raised on all VMs associated with the switch. Similarly, we could want the state of all VMs to be changed to "ERROR". We would like to be able to do this even if, perhaps due to the problem with the switch, we cannot directly monitor the state of the VMs, since we can deduce this problem from the state of the switch.

Root Cause Indicators

Furthermore, we would like to be able to track this cause and effect - that the problem in the switch caused a problem in the VMs. In the following image, we highlight a single connection between the cause and effect for clarity - but all such links will be supported.

Important Note: not all deduced alarms are caused by the trigger - the trigger might only be an indication of correlation, not causation. In the case we are examining, however, the trigger is also the cause:

Root Cause Link

Once the local "causes" links (one hop) are detected and registered, we can follow them one hop after another to track the full causal chain of a sequence of events.

Design & Implementation

For each triggered action, we provide a template which indicates the pattern to be detected and relevant actions. When the pattern is detected, an action (also specified in the template) is executed. Each template will be comprised of a single pattern to search for in the VIM, and additionally one or more isolated actions indicating the actions to take once the pattern is detected. Each action specifies which entity (or entities) in the template to perform the action on. For example, "raise alarm on machine (ID = 4)", where there is a machine in the template with ID = 4. In this examples, for each matching pattern, the action is performed on the entity that is mapped to machine #4 in the template (machines 10041-10044 in the example above)

Templates

The following are sample templates, corresponding to the example above:

Deduced alarm / deduced state:

Root Cause Link

Root Cause Link

As seen here, the action is disconnected from the template. Given a matching alarm of type Y, it is only a matter of locating the machines associated with the relevant switch and host, using the template as a guide, after which an alarm is raised / the state is changed.

RCA link:

Root Cause Link

Similarly, when the template is detected after the deduced alarms have been raised using the previous templates, a causal link can be added from Y to X.

Development (Blueprints, Roadmap, Design...)


Subpages