Vitrage

What is Vitrage?
Vitrage is the OpenStack RCA (Root Cause Analysis) service for organizing, analyzing and expanding OpenStack alarms & events, yielding insights regarding the root cause of problems and deducing their existence before they are directly detected.

High Level Functionality

 * 1) Physical-to-Virtual entities mapping
 * 2) Deduced alarms and states (i.e., raising an alarm or modifying a state based on analysis of the system, instead of direct monitoring)
 * 3) Root Cause Analysis (RCA) for alarms/events
 * 4) Horizon plugin for the above features

High Level Architecture


Vitrage Data Source(s). Responsible for importing information from different sources, regarding the state of the system. This includes information regarding resources both physical & virtual, alarms, etc.. The information is then processed into the Vitrage Graph. Currently Vitrage comes ready with data sources for Nova, Cinder, and Aodh OpenStack projects, Nagios alarms, and a static Physical Resources data source.

Vitrage Graph. Holds the information collected by the Data Sources, as well as their inter-relations. Additionally, it implements a collection of basic graph algorithms that are used by the Vitrage Evaluator (e.g., sub-matching, BFS, DFS etc).

Vitrage Evaluator. Coordinates the analysis of (changes to) the Vitrage Graph and processes the results of this analysis. It is responsible for execution different kind of template-based actions in Vitrage, such as to add an RCA (Root Cause Analysis) relationship between alarms, raise a deduced alarm or set a deduced state.

For more information, refer to the low level design

Baseline


We consider the following example, where a we are monitoring a Switch (id 1002), for example via Nagios, and as a problem on the Switch causes a Nagios alarm (a.k.a. Nagios test) to be activated. The following image depicts the logical relationship among different resources in the system that are related to this switch, as well as the raised alarm. Note the mapping between virtual (instance) and physical (host, switch) entities, as well as between the alarm and the switch it relates to.

Deduced alarms & states


Problems on the switch can, at times, have a negative impact on the virtual instances running on hosts attached to the switch. We would like to raise an alarm on those instances to indicate this impact, as shown here:

As can be seen, the problem on the switch should trigger an alarm on all instances associated with the switch. Similarly, we might want the state of all these instances to be changed to "ERROR" as well. This functionality should be supported even if we cannot directly monitor the state of the instances. Instances might not be monitored for all aspects of performance, or perhaps the problem in the switch makes monitoring them difficult or even impossible. Instead, we can deduce this problem exists on the instances based on the state of the switch, and raise alarms and change states accordingly.

Root Cause Indicators


Furthermore, we would like to be able to track this cause and effect - that the problem in the switch caused the problems experienced at the instances. In the following image, we highlight a single connection between the cause and effect for clarity - but all such links should be supported.

Important Note: not all deduced alarms are caused by the trigger - the trigger might only be an indication of correlation, not causation. In the case we are examining, however, the trigger alarm is also the cause:

Once the local "causes" links (one hop) are detected and registered, we can follow them one hop after another to track the full causal chain of a sequence of events.

Quick Demos (A bit outdated)

 * Vitrage Functionalities Overview
 * Vitrage Get Topology Demo
 * Vitrage Alarms Demo
 * Vitrage Deduced Alarms and RCA Demo

OpenStack Austin, April 2016

 * Project Vitrage How to Organize, Analyze and Visualize your OpenStack Cloud
 * On the Path to Telco Cloud Openness: Nokia CloudBand Vitrage & OPNFV Doctor collaboration

OPNFV Berlin, June 2016

 * Failure Inspection in Doctor utilizing Vitrage and Congress
 * Doctor: fast and dynamic fault management in OpenStack (DOCOMO, NTT, NEC, Nokia, Intel) - Telecom TV

OpenStack Barcelona, October 2016

 * OpenStack Keynotes demo with Doctor - Keeping Your Mobile Phone Calls Connected
 * Root Cause Analysis Principles and Practice in OpenStack and Beyond
 * Fault Management with OpenStack Congress and Vitrage Based on OPNFV Doctor Framework

OpenStack Boston, May 2017

 * Beyond Automation - Taking Vitrage Into the Realm of Machine Learning
 * Collectd and Vitrage Integration - An Eventful Presentation
 * The Vitrage Story - From Nothing to the Big Tent
 * Advanced Use Cases for Root Cause Analysis
 * Project Update Vitrage

OpenStack Sydney, November 2017

 * Advanced Fault Management with Vitrage and Mistral
 * Vitrage Project Updates

OpenStack Vancouver, May 2018

 * Closing the Loop: VNF end-to-end Failure Detection and Auto Healing
 * Extend Horizon Headers for easy monitoring and fault detection - and more
 * Vitrage - Project Update
 * Proactive Root Cause Analysis with Vitrage, Kubernetes, Zabbix and Prometheus

Development (Blueprints, Roadmap, Design...)

 * Vitrage Documentation
 * Vitrage in StoryBoard:
 * Main board
 * Bugs
 * StoryBoard how-to
 * Road Map
 * Source code:
 * vitrage
 * python-vitrageclient
 * vitrage-dashboard
 * vitrage-tempest-plugin

Design Discussions

 * Supporting Overlapping Templates
 * Barcelona Design Summit
 * Pike PTG
 * Queens PTG
 * Vancouver forum: Vitrage advanced use cases
 * Vancouver forum: Vitrage RCA over Kubernetes

Meetings

 * Weekly on Wednesday at 0800 UTC in #openstack-meeting-4 at freenode
 * Check Vitrage Meetings for more details

Contact Us

 * IRC channel for regular daily discussions: #openstack-vitrage
 * Use [Vitrage] tag for Vitrage emails on OpenStack Mailing List