Jump to: navigation, search

Difference between revisions of "Vitrage"

(Use Cases)
(High Level Architecture)
 
(77 intermediate revisions by 5 users not shown)
Line 1: Line 1:
<gallery>
+
__NOTOC__
File:Vitrage logo finaly.png
 
</gallery>
 
  
 +
[[File:OpenStack_Project_Vitrage_horizontal.png|450px|thumbnail|right]]
  
 
== What is Vitrage? ==
 
== What is Vitrage? ==
Vitrage is the Openstack RCA (Root Cause Analysis) Engine for organizing, analyzing and expanding OpenStack alarms & events, yielding insights regarding the root cause of problems and deducing the existence of problems before they are directly detected.
+
Vitrage is the OpenStack RCA (Root Cause Analysis) service for organizing, analyzing and expanding OpenStack alarms & events, yielding insights regarding the root cause of problems and deducing their existence before they are directly detected.
 +
 
 +
=== High Level Functionality ===
 +
# Physical-to-Virtual entities mapping
 +
# Deduced alarms and states (i.e., raising an alarm or modifying a state based on analysis of the system, instead of direct monitoring)
 +
# Root Cause Analysis (RCA) for alarms/events
 +
# Horizon plugin for the above features
  
== Mission & Scope ==
 
Vitrage is a project dedicated to making the events and alarms in OpenStack more meaningful and helpful. The ideal to which we strive is that every significant event in the system should have a timely alarm/event generated for it, that alarms are raised as early as possible after the event, and that the cause-effect relationships between different events is understood and visualized.
 
  
== High Level Functionality ==
 
# Root Cause Analysis (RCA) for alarms/events
 
# Deduced alarms and states (i.e., raising an alarm or modifying a state based on analysis of system, not only direct monitoring)
 
# Alarm Aggregation (i.e., grouping alarms by categories, such as resources and severity, making them more manageable and understandable)
 
# Physical-to-Virtual entities mapping
 
# UI support for all features above
 
  
== High Level Architecture ==
 
  
[[File:Vitrage-high level architecture2.png|2000px|frameless|center|Vitrage High Level Architecture]] <br />
 
  
  
'''Vitrage Synchronizer(s)'''
 
  
Responsible for importing information regarding all the components (physical, virtual, alarms, etc..).
+
== High Level Architecture ==
Each synchronizer can collect this data from different sources, including (but not limited to) Openstack, and saves it in Vitrage Graph.
 
Many instances of the synchronizer can be used in Vitrage. They will all listen to the bus, but will perform actions according to the each ones purpose (this is made for scalability and performance issues).
 
  
 +
[[File:Vitrage_architecture_train.png|2000px|frameless|center|Vitrage High Level Architecture]] <br />
  
'''Vitrage Graph'''
 
  
Representation of the different entities in the Cloud and their inter-relations. Relationships can range from the topological (e.g., which host a VM is hosted on) to the logical (e.g., one alert causes a different alert). It contains the graph DB itself and a collection of basic graph algorithms (e.g., sub-matching algorithms , BFS, DFS and etc).
+
'''Vitrage Data Source(s).''' Responsible for importing information from different sources, regarding the state of the system. This includes information regarding resources both physical & virtual, alarms, etc.. The information is then processed into the Vitrage Graph. Currently Vitrage comes ready with data sources for Nova, Cinder, and Aodh OpenStack projects, Nagios alarms, and a static Physical Resources data source.  
  
 +
'''Vitrage Graph.''' Holds the information collected by the Data Sources, as well as their inter-relations. Additionally, it implements a collection of basic graph algorithms that are used by the Vitrage Evaluator (e.g., sub-matching, BFS, DFS etc).
  
'''Vitrage Evaluator'''
+
'''Vitrage Evaluator.''' Coordinates the analysis of (changes to) the Vitrage Graph and processes the results of this analysis. It is responsible for execution different kind of template-based actions in Vitrage, such as to add an RCA (Root Cause Analysis) relationship between alarms, raise a deduced alarm or set a deduced state.
  
Coordinates the analysis of the Vitrage Graph and processes the results of this analysis.
+
For more information, refer to the [https://docs.openstack.org/vitrage/latest/contributor/vitrage-graph-design.html low level design]
Responsible for execution different kind of actions on the vitrage graph, such as:
 
*RCA – Root Cause Analysis
 
*Deduce Alarms – Raise an alarm as a result of other alarm(s) or other events and changes in the Cloud
 
  
 +
== Use Cases ==
 +
=== Baseline ===
 +
[[File:Rca-baseline.jpg|400px|frameless|right|Baseline]] <br />
  
'''Vitrage Notifier'''
+
We consider the following example, where a we are monitoring a Switch (id 1002), for example via Nagios, and as a problem on the Switch causes a Nagios alarm (a.k.a. Nagios test) to be activated. The following image depicts the logical relationship among different resources in the system that are related to this switch, as well as the raised alarm. Note the mapping between virtual (instance) and physical (host, switch) entities, as well as between the alarm and the switch it relates to.
  
Responsible to inform the different services with the Vitrage evaluator results.
 
  
  
'''Vitrage API'''
 
  
The API for Vitrage. Enables to receive information from the Vitrage Graph regarding the cloud and analysis related to root cause.
 
  
  
'''Vitrage CLI'''
 
  
The CLI for Vitrage. Enables to receive information from the Vitrage Graph regarding the cloud and analysis related to root cause.
 
  
  
'''Vitrage UI'''
 
  
Enables the client to check and monitor the graph and RCA.
 
  
== Use Cases ==
 
=== Baseline ===
 
[[File:Rca-baseline.jpg|400px|frameless|right|Baseline]] <br />
 
  
We consider the following example, where a we are monitoring a Switch (id 1002), for example via Nagios test, and as a result an alarm is raised on a Switch. The following image depicts the logical relationship among different resources in the system that are related to this switch. Note the mapping between virtual (machine) and physical (host, switch) entities.
 
  
  
  
  
 +
=== Deduced alarms & states ===
 +
[[File:DeducedAlarm.jpg|500px|frameless|right|Deduced Alarm]]  <br />
  
 +
Problems on the switch can, at times, have a negative impact on the virtual instances running on hosts attached to the switch. We would like to raise an alarm on those instances to indicate this impact, as shown here:
  
 +
As can be seen, the problem on the switch should trigger an alarm on all instances associated with the switch. Similarly, we might want the state of all these instances to be changed to "ERROR" as well. This functionality should be supported even if we cannot directly monitor the state of the instances. Instances might not be monitored for all aspects of performance, or perhaps the problem in the switch makes monitoring them difficult or even impossible. Instead, we can '''deduce''' this problem exists on the instances based on the state of the switch, and raise alarms and change states accordingly.
  
  
Line 82: Line 68:
  
  
=== Deduced alarms & states ===
 
[[File:DeducedAlarm.jpg|500px|frameless|right|Deduced Alarm]]  <br />
 
  
The problems on the switch can, at times, have a bad impact on the VMs running on hosts attached to the switch, and we would like to have an alarm on those VMs to indicate this, as shown here:
 
  
As can be seen, an alarm is raised on all VMs associated with the switch. Similarly, we could want the state of all VMs to be changed to "ERROR". We would like to be able to do this even if, perhaps due to the problem with the switch, we cannot directly monitor the state of the VMs, since we can deduce this problem from the state of the switch.
 
  
  
  
 +
=== Root Cause Indicators ===
 +
[[File:RootCauseExample.jpg|500px|frameless|right|Root Cause Link]]  <br />
  
 +
Furthermore, we would like to be able to track this cause and effect - that the problem in the switch caused the problems experienced at the instances. In the following image, we highlight a single connection between the cause and effect for clarity - but all such links should be supported.
  
 +
Important Note: not all deduced alarms are ''caused'' by the trigger - the trigger might only be an indication of correlation, not causation. In the case we are examining, however, the trigger alarm is also the cause:
  
 +
Once the local "causes" links (one hop) are detected and registered, we can follow them one hop after another to track the full causal chain of a sequence of events.
  
  
Line 105: Line 92:
  
  
=== Root Cause Indicators ===
 
[[File:RootCauseExample.jpg|500px|frameless|right|Root Cause Link]]  <br />
 
  
Furthermore, we would like to be able to track this cause and effect - that the problem in the switch caused a problem in the VMs. In the following image, we highlight a single connection between the cause and effect for clarity - but all such links will be supported.
+
== Demos and Presentations ==
 +
=== Quick Demos (A bit outdated) ===
 +
* [https://www.youtube.com/watch?v=tl5AD5IdzMo&feature=youtu.be Vitrage Functionalities Overview]
 +
* [https://www.youtube.com/watch?v=GyTnMw8stXQ&feature=youtu.be Vitrage Get Topology Demo]
 +
* [https://www.youtube.com/watch?v=w1XQATkrdmg Vitrage Alarms Demo]
 +
* [https://www.youtube.com/watch?v=vqlOKTmYR4c Vitrage Deduced Alarms and RCA Demo]
  
Important Note: not all deduced alarms are caused by the trigger - the trigger might only be an indication of correlation, not causation. In the case we are examining, however, the trigger is also the cause:
+
=== Summit Sessions ===
 +
==== OpenStack Austin, April 2016 ====
 +
* [https://www.youtube.com/watch?v=9Qw5coTLgMo Project Vitrage How to Organize, Analyze and Visualize your OpenStack Cloud]
 +
* [https://www.youtube.com/watch?v=ey68KNKXc5c On the Path to Telco Cloud Openness: Nokia CloudBand Vitrage & OPNFV Doctor collaboration]
 +
==== OPNFV Berlin, June 2016 ====
 +
* [https://www.youtube.com/watch?v=qV4eLhsFR28 Failure Inspection in Doctor utilizing Vitrage and Congress]
 +
* [https://www.youtube.com/watch?v=xutITYoZKhE Doctor: fast and dynamic fault management in OpenStack (DOCOMO, NTT, NEC, Nokia, Intel) - Telecom TV]
 +
==== OpenStack Barcelona, October 2016 ====
 +
* [https://www.openstack.org/videos/video/demo-openstack-and-opnfv-keeping-your-mobile-phone-calls-connected OpenStack Keynotes demo with Doctor - Keeping Your Mobile Phone Calls Connected]
 +
* [https://www.openstack.org/videos/video/nokia-root-cause-analysis-principles-and-practice-in-openstack-and-beyond Root Cause Analysis Principles and Practice in OpenStack and Beyond]
 +
* [https://www.openstack.org/videos/video/fault-management-with-openstack-congress-and-vitrage-based-on-opnfv-doctor-framework Fault Management with OpenStack Congress and Vitrage Based on OPNFV Doctor Framework]
 +
==== OpenStack Boston, May 2017 ====
 +
* [https://www.openstack.org/videos/boston-2017/beyond-automation-taking-vitrage-into-the-realm-of-machine-learning Beyond Automation - Taking Vitrage Into the Realm of Machine Learning]
 +
* [https://www.openstack.org/videos/boston-2017/collectd-and-vitrage-integration-an-eventful-presentation Collectd and Vitrage Integration - An Eventful Presentation]
 +
* [https://www.openstack.org/videos/boston-2017/the-vitrage-story-from-nothing-to-the-big-tent The Vitrage Story - From Nothing to the Big Tent]
 +
* [https://www.openstack.org/videos/boston-2017/advanced-use-cases-for-root-cause-analysis Advanced Use Cases for Root Cause Analysis]
 +
* [https://www.openstack.org/videos/boston-2017/project-update-vitrage Project Update Vitrage]
 +
==== OpenStack Sydney, November 2017 ====
 +
* [https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral Advanced Fault Management with Vitrage and Mistral]
 +
* [https://www.openstack.org/videos/sydney-2017/vitrage-project-updates Vitrage Project Updates]
 +
==== OpenStack Vancouver, May 2018 ====
 +
* [https://www.openstack.org/videos/vancouver-2018/closing-the-loop-vnf-end-to-end-failure-detection-and-auto-healing Closing the Loop: VNF end-to-end Failure Detection and Auto Healing]
 +
* [https://www.openstack.org/videos/vancouver-2018/extend-horizon-headers-for-easy-monitoring-and-fault-detection-and-more Extend Horizon Headers for easy monitoring and fault detection - and more]
 +
* [https://www.openstack.org/videos/vancouver-2018/vitrage-project-update Vitrage - Project Update]
 +
* [https://www.openstack.org/videos/vancouver-2018/proactive-root-cause-analysis-with-vitrage-kubernetes-zabbix-and-prometheus Proactive Root Cause Analysis with Vitrage, Kubernetes, Zabbix and Prometheus]
  
Once the local "causes" links (one hop) are detected and registered, we can follow them one hop after another to track the full causal chain of a sequence of events.
 
  
 
== Development (Blueprints, Roadmap, Design...) ==
 
== Development (Blueprints, Roadmap, Design...) ==
=== The Team ===
+
* [https://docs.openstack.org/vitrage/latest Vitrage Documentation]
* Contact persons:
+
* Vitrage in StoryBoard:
** Alexey Weyl - [mailto:alexey.weyl@nokia.com alexey.weyl@nokia.com]
+
** [https://storyboard.openstack.org/#!/board/90 Main board]
** Ifat Afek - PTL - [mailto:ifat.afek@nokia.com ifat.afek@nokia.com]
+
** [https://storyboard.openstack.org/#!/board/89 Bugs]
** Elisha Rosensweig - [mailto:elisha.rosensweig@nokia.com elisha.rosensweig@nokia.com]
+
** [https://etherpad.openstack.org/p/vitrage-storyboard-migration StoryBoard how-to]
** Ohad Shamir - [mailto:ohad.shamir@nokia.com ohad.shamir@nokia.com]
+
* [https://wiki.openstack.org/wiki/Vitrage/RoadMap Road Map]
* [https://launchpad.net/~vitrage-drivers Vitrage Team]
+
* Source code:
 +
** [https://github.com/openstack/vitrage vitrage]
 +
** [https://github.com/openstack/python-vitrageclient python-vitrageclient]
 +
** [https://github.com/openstack/vitrage-dashboard vitrage-dashboard]
 +
** [https://github.com/openstack/vitrage-tempest-plugin vitrage-tempest-plugin]
  
=== Communication and Meetings ===
+
=== Design Discussions ===
* IRC channel for regular daily discussions: #openstack-vitrage
+
* [https://etherpad.openstack.org/p/vitrage-overlapping-templates-support-design Supporting Overlapping Templates]
* Weekly on Wednesday at 0900 UTC in #openstack-meeting-3 at freenode
+
* [https://etherpad.openstack.org/p/vitrage-barcelona-design-summit Barcelona Design Summit]
* Use [Vitrage] tag for Vitrage emails on [http://lists.openstack.org/pipermail/openstack-dev/ OpenStack Mailing Lists]
+
* [https://etherpad.openstack.org/p/vitrage-pike-design-sessions Pike PTG]
* Check [https://wiki.openstack.org/wiki/Meetings/Vitrage Vitrage Meetings] for more details
+
* [https://etherpad.openstack.org/p/vitrage-ptg-queens Queens PTG]
 +
* [https://etherpad.openstack.org/p/YVR-vitrage-advanced-use-cases Vancouver forum: Vitrage advanced use cases]
 +
* [https://etherpad.openstack.org/p/YVR-vitrage-rca-over-k8s Vancouver forum: Vitrage RCA over Kubernetes]
  
=== Design and Implementation ===
 
* Project at Launchpad: http://launchpad.net/vitrage
 
* [https://blueprints.launchpad.net/vitrage Blueprints]
 
* [https://github.com/openstack/vitrage Source Code]
 
* Design
 
** [https://github.com/openstack/vitrage/blob/master/doc/source/vitrage-graph-design.rst Vitrage Graph Design]
 
** [https://github.com/openstack/vitrage-specs/blob/master/specs/mitaka/vitrage-synchronizer.rst Vitrage Synchronizer Design]
 
  
=== Demos and Presentations ===
+
== Communication and Meetings ==
* First use case - Get Topology
+
=== Meetings ===
** [https://www.youtube.com/watch?v=GyTnMw8stXQ&feature=youtu.be Vitrage Get Topology Demo]
+
* Weekly on Wednesday at 0800 UTC in #openstack-meeting-4 at freenode
 +
* Check [https://wiki.openstack.org/wiki/Meetings/Vitrage Vitrage Meetings] for more details
  
== Subpages ==
+
=== Contact Us ===
 
+
* IRC channel for regular daily discussions: #openstack-vitrage
{{Special:PrefixIndex/:Vitrage/}}
+
* Use [Vitrage] tag for Vitrage emails on [http://lists.openstack.org/pipermail/openstack-discuss/ OpenStack Mailing List]

Latest revision as of 11:08, 23 May 2019


OpenStack Project Vitrage horizontal.png

What is Vitrage?

Vitrage is the OpenStack RCA (Root Cause Analysis) service for organizing, analyzing and expanding OpenStack alarms & events, yielding insights regarding the root cause of problems and deducing their existence before they are directly detected.

High Level Functionality

  1. Physical-to-Virtual entities mapping
  2. Deduced alarms and states (i.e., raising an alarm or modifying a state based on analysis of the system, instead of direct monitoring)
  3. Root Cause Analysis (RCA) for alarms/events
  4. Horizon plugin for the above features




High Level Architecture

Vitrage High Level Architecture


Vitrage Data Source(s). Responsible for importing information from different sources, regarding the state of the system. This includes information regarding resources both physical & virtual, alarms, etc.. The information is then processed into the Vitrage Graph. Currently Vitrage comes ready with data sources for Nova, Cinder, and Aodh OpenStack projects, Nagios alarms, and a static Physical Resources data source.

Vitrage Graph. Holds the information collected by the Data Sources, as well as their inter-relations. Additionally, it implements a collection of basic graph algorithms that are used by the Vitrage Evaluator (e.g., sub-matching, BFS, DFS etc).

Vitrage Evaluator. Coordinates the analysis of (changes to) the Vitrage Graph and processes the results of this analysis. It is responsible for execution different kind of template-based actions in Vitrage, such as to add an RCA (Root Cause Analysis) relationship between alarms, raise a deduced alarm or set a deduced state.

For more information, refer to the low level design

Use Cases

Baseline

Baseline

We consider the following example, where a we are monitoring a Switch (id 1002), for example via Nagios, and as a problem on the Switch causes a Nagios alarm (a.k.a. Nagios test) to be activated. The following image depicts the logical relationship among different resources in the system that are related to this switch, as well as the raised alarm. Note the mapping between virtual (instance) and physical (host, switch) entities, as well as between the alarm and the switch it relates to.









Deduced alarms & states

Deduced Alarm

Problems on the switch can, at times, have a negative impact on the virtual instances running on hosts attached to the switch. We would like to raise an alarm on those instances to indicate this impact, as shown here:

As can be seen, the problem on the switch should trigger an alarm on all instances associated with the switch. Similarly, we might want the state of all these instances to be changed to "ERROR" as well. This functionality should be supported even if we cannot directly monitor the state of the instances. Instances might not be monitored for all aspects of performance, or perhaps the problem in the switch makes monitoring them difficult or even impossible. Instead, we can deduce this problem exists on the instances based on the state of the switch, and raise alarms and change states accordingly.








Root Cause Indicators

Root Cause Link

Furthermore, we would like to be able to track this cause and effect - that the problem in the switch caused the problems experienced at the instances. In the following image, we highlight a single connection between the cause and effect for clarity - but all such links should be supported.

Important Note: not all deduced alarms are caused by the trigger - the trigger might only be an indication of correlation, not causation. In the case we are examining, however, the trigger alarm is also the cause:

Once the local "causes" links (one hop) are detected and registered, we can follow them one hop after another to track the full causal chain of a sequence of events.







Demos and Presentations

Quick Demos (A bit outdated)

Summit Sessions

OpenStack Austin, April 2016

OPNFV Berlin, June 2016

OpenStack Barcelona, October 2016

OpenStack Boston, May 2017

OpenStack Sydney, November 2017

OpenStack Vancouver, May 2018


Development (Blueprints, Roadmap, Design...)

Design Discussions


Communication and Meetings

Meetings

  • Weekly on Wednesday at 0800 UTC in #openstack-meeting-4 at freenode
  • Check Vitrage Meetings for more details

Contact Us

  • IRC channel for regular daily discussions: #openstack-vitrage
  • Use [Vitrage] tag for Vitrage emails on OpenStack Mailing List