Jump to: navigation, search

Difference between revisions of "Monasca"

(Development Environment)
(OpenStack Monitoring Questionnaire)
(22 intermediate revisions by 6 users not shown)
Line 20: Line 20:
 
: Monasca API Specification: https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md
 
: Monasca API Specification: https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md
 
: Agent Documentation: https://github.com/openstack/monasca-agent
 
: Agent Documentation: https://github.com/openstack/monasca-agent
 +
 +
==== Logo ====
 +
: https://www.dropbox.com/sh/yigfsgvpzxz4t72/AABHPm5FOoBFv2Q-div_j9RXa?dl=0
  
 
==== Presentations ====
 
==== Presentations ====
 +
: '''Boston Summit 2017'''
 +
:: Making Monasca Monitor More: Extending Monasca's Data Gathering and Reporting Capabilities
 +
::: Video: https://www.openstack.org/videos/boston-2017/making-monasca-monitor-more-extending-monascas-data-gathering-and-reporting-capabilities
 +
:: Monitoring Hands-On Workshop
 +
::: Material: https://github.com/witekest/monasca-bootcamp/
 +
:: Show Me My Packet Log: Neutron Packet Logging with Monasca
 +
::: Video: https://www.openstack.org/videos/boston-2017/show-me-my-packet-log-neutron-packet-logging-with-monasca
 +
:: A Monitoring Case Study for Monasca: Smart City Infrastructure
 +
::: Video: https://www.openstack.org/videos/boston-2017/a-monitoring-case-study-for-monasca-smart-city-infrastructure
 +
: '''Barcelona Summit 2016'''
 +
:: Monasca: one year later
 +
::: Video: https://www.openstack.org/videos/barcelona-2016/monasca-one-year-later
 +
:: Fujitsu: Monasca Monitoring for Kubernetes – Delivered on OpenStack
 +
::: Video: https://www.openstack.org/videos/barcelona-2016/fujitsu-monasca-monitoring-for-kubernetes-delivered-on-openstack
 +
: '''Austin Summit 2016'''
 +
:: Monasca Bootcamp:Hands-on Workshops
 +
::: Video: https://www.openstack.org/videos/video/monasca-bootcamp
 +
:: Log Management with Monasca
 +
::: Video: https://www.openstack.org/videos/video/log-management-with-monasca
 +
::: Slides: https://github.com/witekest/presentations/blob/master/OpenStack_Summit/2016_04_Austin/Monasca_Logging_OpenStack_Summit_Austin.pdf
 +
:: Enforcing Application SLAs with Congress and Monasca
 +
::: Video: https://www.openstack.org/videos/video/enforcing-application-slas-with-congress-and-monasca
 +
:: Only You Can Prevent Forest Fires - A Proactive Approach to Monitoring Your OpenStack Cloud
 +
::: https://www.openstack.org/videos/video/only-you-can-prevent-forest-fires-a-proactive-approach-to-monitoring-your-openstack-cloud
 +
:: Monitoring a Multi-Region Cloud Based on OpenStack: The FIWARE Lab Case Study
 +
::: Video: https://www.openstack.org/videos/video/monitoring-a-multi-region-cloud-based-on-openstack-the-fiware-lab-case-study
 
: '''Tokyo Summit 2015'''
 
: '''Tokyo Summit 2015'''
 
:: Congrats You Stood up an OpenStack Environment, But Now They Want You to Monitor It. Introducing Using Monasca for Production OpenStack Monitoring
 
:: Congrats You Stood up an OpenStack Environment, But Now They Want You to Monitor It. Introducing Using Monasca for Production OpenStack Monitoring
Line 30: Line 59:
 
::: Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/elk-and-monasca-crossing-logging-as-an-openstack-service
 
::: Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/elk-and-monasca-crossing-logging-as-an-openstack-service
 
::: Slides: http://www.slideshare.net/mroderus/elk-and-monasca-crossing-logging-as-an-openstack-service
 
::: Slides: http://www.slideshare.net/mroderus/elk-and-monasca-crossing-logging-as-an-openstack-service
 +
:: Auto Scaling Cloud Infrastructure and Applications
 +
::: Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/auto-scaling-cloud-infrastructure-and-applications
  
 
: '''Monasca Deep Dive (Paris Summit)'''
 
: '''Monasca Deep Dive (Paris Summit)'''
Line 50: Line 81:
 
:: Slides: https://github.com/dlfryar/monasca-meetups/blob/master/2015/08/sunnyvale/openstack/meetup-openstack-sfbay-2015.pdf
 
:: Slides: https://github.com/dlfryar/monasca-meetups/blob/master/2015/08/sunnyvale/openstack/meetup-openstack-sfbay-2015.pdf
 
:: MeetUp: http://www.meetup.com/openstack/events/215648142/?a=cr2_grp&rv=cr2&_af=event&_af_eid=215648142
 
:: MeetUp: http://www.meetup.com/openstack/events/215648142/?a=cr2_grp&rv=cr2&_af=event&_af_eid=215648142
 
==== Demo ====
 
:Docker based Image useful for demos or a running API to hit from outside: https://registry.hub.docker.com/u/monasca/demo/
 
  
 
==== Repositories ====
 
==== Repositories ====
Line 60: Line 88:
  
 
: '''Deployment'''
 
: '''Deployment'''
The following repositories are available for deploying Monasca.
+
The following repositories are available for deploying Monasca:
:: Ansible: https://github.com/search?utf8=%E2%9C%93&q=ansible-monasca
+
* Docker: https://github.com/monasca/monasca-docker
:: Puppet: https://git.openstack.org/openstack/puppet-monasca
+
* Kubernetes: https://github.com/monasca/monasca-helm
 +
* Ansible: https://github.com/search?utf8=%E2%9C%93&q=ansible-monasca
 +
* Puppet: https://git.openstack.org/openstack/puppet-monasca
 +
 
 +
==== Bugs ====
 +
 
 +
: https://storyboard.openstack.org/#!/project_group/59
  
 
== Requirements ==
 
== Requirements ==
Line 86: Line 120:
  
 
* Open-source monitoring solution built on open-source technologies.
 
* Open-source monitoring solution built on open-source technologies.
 +
 +
== Comparisons to alternatives ==
 +
[[Monasca/Comparison]]
  
 
== Architecture ==
 
== Architecture ==
Line 199: Line 236:
 
== Logging ==
 
== Logging ==
 
Support for logging in Monasca is under discussion. For more details see the link at, [[Monasca/Logging]].
 
Support for logging in Monasca is under discussion. For more details see the link at, [[Monasca/Logging]].
 +
 +
== Transform and Aggregation Engine ==
 +
For more details see the link at, [[Monasca/Transform]].
  
 
== Analytics ==
 
== Analytics ==
Line 229: Line 269:
 
# The Notification Engine consumes "alarm-state-transitioned-events" from the Message Queue, evaluates whether they have a Notification Method associated with it, and sends the appropriate notification, such as email.
 
# The Notification Engine consumes "alarm-state-transitioned-events" from the Message Queue, evaluates whether they have a Notification Method associated with it, and sends the appropriate notification, such as email.
 
# The Persister consumes the "alarm-state-transitioned-event" from the Message Queue and stores it in the Alarm State History Store.
 
# The Persister consumes the "alarm-state-transitioned-event" from the Message Queue and stores it in the Alarm State History Store.
 +
 +
== Alarm Managers ==
 +
This section describes the new features including alarm grouping, inhibition and silencing.
 +
 +
Examples:
 +
 +
* Silence
 +
SilenceRule1 = '{"alarm-silencing-definition-created": {"name": "silence_rule_1", "matchers": {"severity": "LOW"}, "start_time": "2017-02-21 20:00:00", "end_time": "2017-02-21 22:00:00"}}'
 +
Two alarm transitions: AT1 and AT2
 +
AT1_severity = HIGH
 +
AT2_severity = LOW
 +
 +
Output: AT2 get silenced and AT1 send a notification.
 +
SilenceRule1 expires after "end_time"="2017-02-21 22:00:00".
 +
 +
* Inhibit
 +
InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"], "exclusions": {"alarm_name": "vm is dead"}}}'
 +
Three alarm transitions: AT1, AT2 and AT3
 +
AT1_tenantId = "d42bc"
 +
AT2_tenantId = "d42bc"
 +
AT3_tenantId = "d42bc"
 +
AT1_severity = HIGH
 +
AT2_severity = LOW
 +
AT3_severity = LOW
 +
AT1_alarm_name = "cpu high"
 +
AT2_alarm_name = "memory high"
 +
AT3_alarm_name = "vm is dead"
 +
AT1_state = ALARM
 +
AT2_state = ALARM
 +
AT3_state = ALARM
 +
 +
Output: AT1 is the source alarm which will send a notification. AT2 is the target alarm and will get inhibited. AT3 matches the exclusions and will send a notification immediately.
 +
 +
* Grouping Scenario 1
 +
 +
GroupingRule1 = '{"alarm-grouping-definition-created":
 +
{"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "exclusions": {"alarm_name": "cpu_percent_high"}, "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'
 +
 +
Three alarm transitions: AT1, AT2 and AT3
 +
 +
AT1_hostname = host1
 +
AT2_hostname = host1
 +
AT3_hostname = host2
 +
AT1_alarm_name = cpu_percent_high
 +
AT2_alarm_name = cpu_system_perc_high
 +
AT3_alarm_name = cpu_percent_high
 +
AT1_state = ALARM
 +
AT2_state = ALARM
 +
AT3_state = ALARM
 +
 +
Output: AT1 and AT3 match exclusions and send notifications immediately. Generate a grouped notification “group_notification_rule_1_host1_alarm[1]” and send out using alarm_actions ["cd892"]. There are no alarm_actions, ok_actions or undermined_actions associated with the AT1, AT2, AT3 alarm definitions.
 +
 +
* Grouping Scenario 2
 +
 +
GroupingRule2 = '{"alarm-grouping-definition-created":
 +
{"name": "group_rule_2", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'
 +
 +
Three alarm transitions: AT1 with "alarm_actions": ["123ab"], AT2 with "alarm_actions": ["cd839"] and AT3
 +
 +
AT1_hostname = host1
 +
AT2_hostname = host1
 +
AT3_hostname = host2
 +
AT1_state = ALARM
 +
AT2_state = ALARM
 +
AT3_state = ALARM
 +
 +
Output: Generate two grouped notifications “group_notification_rule_2_host1_alarm[2]” and “group_notification_rule_2_host2_alarm[1]”. Both using alarm_actions ["cd892"]. Also since AT1 and AT2 has their own alarm actions associated with them, there will be two more notifications sent out.
 +
 +
* Silenced and Grouped
 +
 +
SilenceRule1 = '{"alarm-silencing-definition-created": {"name": "silence_rule_1", "matchers": {"severity": "LOW"}, "start_time": "1487269470498", "end_time": "1587269470498"}}'
 +
 +
GroupingRule2 = '{"alarm-grouping-definition-created":
 +
{"name": "group_rule_2", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'
 +
 +
Four alarm transitions: AT1, AT2, AT3 and AT4
 +
 +
Silencing rule:
 +
AT1_severity = HIGH
 +
AT2_severity = LOW
 +
AT3_severity = HIGH
 +
AT4_severity = HIGH
 +
 +
Grouping rule:
 +
AT1_hostname = host1
 +
AT2_hostname = host1
 +
AT3_hostname = host2
 +
AT4_hostname = host1
 +
AT1_state = ALARM
 +
AT2_state = ALARM
 +
AT3_state = OK
 +
AT4_state = ALARM
 +
 +
Output: Generate two grouped notifications “group_notification_rule_2_host1_alarm[2]” using alarm action "cd892" and “group_notification_rule_2_host2_ok[1]” using ok action "ad892". AT2 is silenced so it isn't include in group_notification_rule_2_host1 count. There are no alarm_actions, ok_actions or undermined_actions associated with the AT1, AT2, AT3, AT4 alarm definitions.
 +
 +
* Silenced and inhibited (source alarm get silenced)
 +
 +
SilenceRule2 = '{"alarm-silencing-definition-created": {"name": "silence_rule_2", "matchers": {"severity": "HIGH, "hostname": "host1"}, "start_time": "2017-02-21 15:00:00", "end_time": "2017-02-21 21:00:00"}}'
 +
 +
InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"]}}'
 +
 +
Alarm transitions: AT1, AT2
 +
 +
Inhibition rule:
 +
AT1_state = ALARM
 +
AT2_state = ALARM
 +
AT1_tenantId = "d42bc"
 +
AT2_tenantId = "d42bc"
 +
AT1_severity = HIGH
 +
AT2_severity = LOW
 +
 +
Silencing rule:
 +
AT1_hostname = host1
 +
AT2_hostname = host2
 +
AT1_severity = HIGH
 +
AT2_severity = LOW
 +
 +
Output: no notification sent out. For inhibition, AT1 is the source alarm, AT2 is the target alarm. But at the same time, AT1 get silenced because it matches the silence rule.
 +
 +
* Inhibited and grouped
 +
 +
InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"]}}'
 +
 +
GroupingRule1 = '{"alarm-grouping-definition-created": {"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163", "repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'
 +
 +
Alarm transitions: AT1, AT2, AT3
 +
 +
Inhibition rule:
 +
AT1_tenantId = "d42bc"
 +
AT2_tenantId = "d42bc"
 +
AT3_tenantId = "d42bc"
 +
AT1_severity = HIGH
 +
AT2_severity = LOW
 +
AT3_severity = HIGH
 +
AT1_state = OK
 +
AT2_state = ALARM
 +
AT3_state = ALARM
 +
 +
Grouping rule:
 +
AT1_hostname = host1
 +
AT2_hostname = host2
 +
AT3_hostname = host1
 +
AT1_state = OK
 +
AT2_state = ALARM
 +
AT3_state = ALARM
 +
 +
Output: AT2 gets inhibited because its severity is low. AT3 is the source alarm. Since AT1 is in OK state, it is not a source alarm. For grouping, AT1 and AT3 has the same host name but different state. So there will be two grouped notifications sent out: “group_notification_rule_1_host1_ok[1]” and  “group_notification_rule_1_host1_alarm[1]”.
 +
 +
* Silenced, inhibited and grouped
 +
 +
SilenceRule2 = '{"alarm-silencing-definition-created": {"name": "silence_rule_2", "matchers": {"severity": "HIGH, "hostname": "host1"}, "start_time": "2017-02-21 15:00:00", "end_time": "2017-02-21 21:00:00"}}'
 +
 +
InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"]}}'
 +
 +
GroupingRule1 = '{"alarm-grouping-definition-created": {"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163", "repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'
 +
 +
Alarm transitions: AT1, AT2, AT3 and AT5
 +
 +
Silence rule:
 +
AT1_severity = HIGH AT1_hostname = host1 (silenced)
 +
AT2_severity = LOW AT2_hostname = host2
 +
AT3_severity = HIGH AT3_hostname = host1(silenced)
 +
AT5_severity = HIGH AT5_hostname = host3
 +
 +
Inhibition rule:
 +
AT1_tenantId = "d42bc"
 +
AT2_tenantId = "d42bc"
 +
AT3_tenantId = "d42bc"
 +
AT5_tenantId = "d42bc"
 +
AT1_state = ALARM
 +
AT2_state = ALARM
 +
AT3_state = OK
 +
AT5_state = UNDETERMINED
 +
AT1_severity = HIGH (source)
 +
AT2_severity = LOW (target)
 +
AT3_severity = HIGH
 +
AT5_severity = HIGH
 +
 +
Grouping rule:
 +
AT1_hostname = host1
 +
AT2_hostname = host2
 +
AT3_hostname = host1
 +
AT5_hostname = host3
 +
 +
Output: AT1 is in "group_notification_rule_1_host1_alarm" group and silenced. AT3 is in "group_notification_rule_1_host1_ok" group and silenced.  AT2 is in "group_notification_rule_1_host2_alarm" group and inhibited. AT5 is in "group_notification_rule_1_host3_undetermined" group and will send notification “group_notification_rule_1_host3_undetermined[1]” using undetermined action "cf892".
  
 
= Development Environment =
 
= Development Environment =
Line 248: Line 473:
 
** Note, all components in Monasca, except for the Threshold Engine, have been ported to Python.
 
** Note, all components in Monasca, except for the Threshold Engine, have been ported to Python.
  
* Java: Several of the Monasca components are available as Java. OpenStack does not have any Java coding standards. We've adopted the Google Java Style at, https://google-styleguide.googlecode.com/svn/trunk/javaguide.html.
+
* Java: Several of the Monasca components are available as Java. OpenStack does not have any Java coding standards. We've adopted the Google Java Style at, https://google.github.io/styleguide/javaguide.html.
 
** The standard says either 80 or 100 length lines. We've adopted 100.
 
** The standard says either 80 or 100 length lines. We've adopted 100.
  
Line 271: Line 496:
 
* InfluxDB (http://influxdb.com/): An open-source distributed time series database with no external dependencies. InfluxDB is supported for the Metrics Database.
 
* InfluxDB (http://influxdb.com/): An open-source distributed time series database with no external dependencies. InfluxDB is supported for the Metrics Database.
  
* Vertica (http://www.vertica.com): A commercial Enterprise class SQL analytics database that is highly scalable. It offers built-in automatic high-availability and excels at in-database analytics and compressing and storing massive amounts of data. In the HP Public Cloud we use Vertica in a number of areas such as metrics and many other data streams. Currently, we process around 25 K metrics/sec and store them for > 13 month data retention periods. A free version of Vertica that can store up to 1 TB of data with no time-limit is available at, https://my.vertica.com/community/. Vertica is supported for the Metrics Database.
+
* Vertica (http://www.vertica.com): A commercial Enterprise class SQL analytics database that is highly scalable. It offers built-in automatic high-availability and excels at in-database analytics and compressing and storing massive amounts of data. A free community version of Vertica is available that can store up to 1 TB of data with no time-limit is available at, https://my.vertica.com/community/. Vertica is supported for the Metrics Database.
  
 
* Cassandra: Support for Cassandra for the Metrics Database is in progress.
 
* Cassandra: Support for Cassandra for the Metrics Database is in progress.

Revision as of 13:20, 28 June 2017

Overview

Monasca is a open-source multi-tenant, highly scalable, performant, fault-tolerant monitoring-as-a-service solution that integrates with OpenStack. It uses a REST API for high-speed metrics processing and querying and has a streaming alarm engine and notification engine.

Project Launchpad

https://launchpad.net/monasca

Team Launchpad

https://launchpad.net/~monasca

Communication and Meetings

Documentation

Monasca API Specification: https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md
Agent Documentation: https://github.com/openstack/monasca-agent

https://www.dropbox.com/sh/yigfsgvpzxz4t72/AABHPm5FOoBFv2Q-div_j9RXa?dl=0

Presentations

Boston Summit 2017
Making Monasca Monitor More: Extending Monasca's Data Gathering and Reporting Capabilities
Video: https://www.openstack.org/videos/boston-2017/making-monasca-monitor-more-extending-monascas-data-gathering-and-reporting-capabilities
Monitoring Hands-On Workshop
Material: https://github.com/witekest/monasca-bootcamp/
Show Me My Packet Log: Neutron Packet Logging with Monasca
Video: https://www.openstack.org/videos/boston-2017/show-me-my-packet-log-neutron-packet-logging-with-monasca
A Monitoring Case Study for Monasca: Smart City Infrastructure
Video: https://www.openstack.org/videos/boston-2017/a-monitoring-case-study-for-monasca-smart-city-infrastructure
Barcelona Summit 2016
Monasca: one year later
Video: https://www.openstack.org/videos/barcelona-2016/monasca-one-year-later
Fujitsu: Monasca Monitoring for Kubernetes – Delivered on OpenStack
Video: https://www.openstack.org/videos/barcelona-2016/fujitsu-monasca-monitoring-for-kubernetes-delivered-on-openstack
Austin Summit 2016
Monasca Bootcamp:Hands-on Workshops
Video: https://www.openstack.org/videos/video/monasca-bootcamp
Log Management with Monasca
Video: https://www.openstack.org/videos/video/log-management-with-monasca
Slides: https://github.com/witekest/presentations/blob/master/OpenStack_Summit/2016_04_Austin/Monasca_Logging_OpenStack_Summit_Austin.pdf
Enforcing Application SLAs with Congress and Monasca
Video: https://www.openstack.org/videos/video/enforcing-application-slas-with-congress-and-monasca
Only You Can Prevent Forest Fires - A Proactive Approach to Monitoring Your OpenStack Cloud
https://www.openstack.org/videos/video/only-you-can-prevent-forest-fires-a-proactive-approach-to-monitoring-your-openstack-cloud
Monitoring a Multi-Region Cloud Based on OpenStack: The FIWARE Lab Case Study
Video: https://www.openstack.org/videos/video/monitoring-a-multi-region-cloud-based-on-openstack-the-fiware-lab-case-study
Tokyo Summit 2015
Congrats You Stood up an OpenStack Environment, But Now They Want You to Monitor It. Introducing Using Monasca for Production OpenStack Monitoring
Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/congrats-you-stood-up-an-openstack-environment-but-now-they-want-you-to-monitor-it-introducing-using-monasca-for-production-openstack-monitoring
Ceilometer+Monasca=Ceilosca
Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/ceilometer-monascaceilosca
ELK and Monasca Crossing: Logging as an OpenStack Service
Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/elk-and-monasca-crossing-logging-as-an-openstack-service
Slides: http://www.slideshare.net/mroderus/elk-and-monasca-crossing-logging-as-an-openstack-service
Auto Scaling Cloud Infrastructure and Applications
Video: https://www.openstack.org/summit/tokyo-2015/videos/presentation/auto-scaling-cloud-infrastructure-and-applications
Monasca Deep Dive (Paris Summit)
Video: https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/monasca-deep-dive-monitoring-at-scale
Slides: https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive-Paris-Summit.pdf
Colorado OpenStack 5th Birthday Monasca Operations
Hangout: https://youtu.be/YyOEU8aICiU
Slides: http://www.slideshare.net/dlfryar/colorado-openstack-5th-birthday-monasca-operations
MeetUp: http://www.meetup.com/OpenStack-Colorado/events/223495998/?a=ea1_grp&rv=ea1
Austin OpenStack OpenStack Monasca Architecture
Github: https://github.com/dlfryar/monasca-meetups/tree/master/2015/07/austin/openstack
Slides: https://github.com/dlfryar/monasca-meetups/blob/master/2015/07/austin/openstack/meetup-openstack-austin-2015.pdf
MeetUp: http://www.meetup.com/OpenStack-Austin/events/223068769/
SFBay OpenStack Advanced Track #OSSFO Topic: Monasca and Heat
Github: https://github.com/dlfryar/monasca-meetups/tree/master/2015/08/sunnyvale/openstack
Hangout: http://youtu.be/E-EEdOoMC-4
Slides: https://github.com/dlfryar/monasca-meetups/blob/master/2015/08/sunnyvale/openstack/meetup-openstack-sfbay-2015.pdf
MeetUp: http://www.meetup.com/openstack/events/215648142/?a=cr2_grp&rv=cr2&_af=event&_af_eid=215648142

Repositories

Core
Core: https://git.openstack.org/cgit/?q=monasca
Deployment

The following repositories are available for deploying Monasca:

Bugs

https://storyboard.openstack.org/#!/project_group/59

Requirements

Monasca/Requirements.

Features

This section describes the overall features.

  • A highly performant, scalable, reliable and fault-tolerant Monitoring as a Service (MONaaS) solution that scales to service provider metrics levels of metrics throughput. Performance, scalability and high-availability have been designed in from the start. Can process 100s of thousands of metrics/sec as well as offer data retention periods of greater than a year with no data loss while still processing interactive queries.
  • Rest API for storing and querying metrics and historical information. Most monitoring solution use special transports and protocols, such as CollectD or NSCA (Nagios). In our solution, http is the only protocol used. This simplifies the overall design and also allows for a much richer way of describing the data via dimensions.
  • Multi-tenant and authenticated. Metrics are submitted and authenticated using Keystone and stored associated with a tenant ID.
  • Metrics defined using a set of (key, value) pairs called dimensions.
  • Real-time thresholding and alarming on metrics.
  • Compound alarms described using a simple expressive grammar composed of alarm sub-expressions and logical operators.
  • Monitoring agent that supports a number of built-in system and service checks and also supports Nagios checks and statsd.
  • Open-source monitoring solution built on open-source technologies.

Comparisons to alternatives

Monasca/Comparison

Architecture

Monasca Architecture Component Diagram

  • Monitoring Agent (monasca-agent): A modern Python based monitoring agent that consists of several sub-components and supports system metrics, such as cpu utilization and available memory, Nagios plugins, statsd and many built-in checks for services such as MySQL, RabbitMQ, and many others.
  • Monitoring API (monasca-api): A well-defined and documented RESTful API for monitoring that is primarily focused on the following concepts and areas:
    • Metrics: Store and query massive amounts of metrics in real-time.
    • Statistics: Query statistics for metrics.
    • Alarm Definitions: Create, update, query and delete alarm definitions.
    • Alarms: Query and delete the alarm history.
      • Simple expressive grammar for creating compound alarms composed of alarm subexpressions and logical operators.
      • Alarm severities can be associated with alarms.
      • The complete alarm state transition history is stored and queryable which allows for subsequent root cause analysis (RCA) or advanced analytics.
    • Notification Methods: Create and delete notification methods and associate them with alarms, such as email. Supports the ability to notify users directly via email when an alarm state transitions occur.
    • The Monasca API has both Java and Python implementations avaialble.
  • Persister (monasca-persister): Consumes metrics and alarm state transitions from the MessageQ and stores them in the Metrics and Alarms database.
    • The Persister has both Java and Python implementations.
  • Transform and Aggregation Engine (monasca-transform): Transform metric names and values, such as delta or time-based derivative calculations, and creates new metrics that are published to the Message Queue. The Transform Engine is not available yet.
  • Anomaly and Prediction Engine: Evaluates prediction and anomalies and generates predicted metrics as well as anomaly likelihood and anomaly scores. The Anomaly and Prediction Engine is currently in a prototype status.
  • Threshold Engine (monasca-thresh): Computes thresholds on metrics and publishes alarms to the MessageQ when exceeded. Based on Apache Storm a free and open distributed real-time computation system.
  • Notification Engine (monasca-notification): Consumes alarm state transition messages from the MessageQ and sends notifications, such as emails for alarms. The Notification Engine is Python based.
  • Analytics Engine (monasca-analytics): Consumes alarm state transisitions and metrics from the MessageQ and does anomaly detection and alarm clustering/correlation.
  • Message Queue: A third-party component that primarily receives published metrics from the Monitoring API and alarm state transition messages from the Threshold Engine that are consumed by other components, such as the Persister and Notification Engine. The Message Queue is also used to publish and consume other events in the system. Currently, a Kafka based MessageQ is supported. Kafka is a high performance, distributed, fault-tolerant, and scalable message queue with durability built-in. We will look at other alternatives, such as RabbitMQ and in-fact in our previous implementation RabbitMQ was supported, but due to performance, scale, durability and high-availability limitiations with RabbitMQ we have moved to Kafka.
  • Metrics and Alarms Database: A third-party component that primarily stores metrics and the alarm state history. Currently, Vertica and InfluxDB are supported. Support for Cassandra is in progress.
  • Config Database: A third-party component that stores a lot of the configuration and other information in the system. Currently, MySQL is supported. Support for PostgreSQL is in progress.
  • Monitoring Client (python-monascaclient): A Python command line client and library that communicates and controls the Monitoring API. The Monitoring Client was written using the OpenStack Heat Python client as a framework. The Monitoring Client also has a Python library, "monascaclient" similar to the other OpenStack clients, that can be used to quickly build additional capabilities. The Monitoring Client library is used by the Monitoring UI, Ceilometer publisher, and other components.
  • Monitoring UI: A Horizon dashboard for visualizing the overall health and status of an OpenStack cloud.
  • Ceilometer publisher: A multi-publisher plugin for Ceilometer, not shown, that converts and publishes samples to the Monitoring API.

Most of the components are described in their respective repositories. However, there aren't any repositories for the third-party components used, so we describe some of the relevant details here.

Message Schema

Monasca/Message Schema

Message Queue

A distributed, performant, scalable, HA message queue for distributing metrics, alarms and events in the monitoring system. Currently, based on Kafka.

Messages

There are several messages that are published and consumed by various components in Monasca via the MessageQ. See Message Schema.

Metrics and Alarms Database

A high-performance analytics database that can store massive amounts of metrics and alarms in real-time and also support interactive queries. Currently Vertica and InfluxDB are supported.

The SQL schema that is used by Vertica is as follows:

  • MonMetrics.Measurements: Stores the actual measurements that are sent.
    • id: An integer ID for the measurement.
    • definition_dimensions_id: A reference to DefinitionDimensions.
    • time_stamp
    • value
  • MonMetrics.DefinitionDimensions
    • id: A sha1 hash of (defintion_id, dimension_set_id)
    • definition_id: A reference to the Definitions.id
    • dimension_set_id: A reference to the Dimensions.dimension_set_id
  • MonMetrics.Definitions
    • id: A sha1 hash of the (name, tenant_id, region)
    • name: Name of the metric.
    • tenant_id: The tenant_id that submitted the metric.
    • region: The region the metric was submitted under.
  • MonMetric.Dimensions
    • dimension_set_id: A sha1 hash of the set of dimenions for a metric.
    • name: Name of dimension.
    • value: Value of dimension.

Config Database

The config database store all the configuration information. Currently based on MySQL.

The SQL schema is as follows:

  • alarm
    • id
    • tenant_id
    • name
    • description
    • expression
    • state
    • actions_enabled
    • created_at
    • updated_at
    • deleted_at
  • alarm_action
    • alarm_id
    • alarm_state
    • action_id
  • notification_method
    • id
    • tenant_id
    • name
    • type
    • address
    • created_at
    • updated_at
  • sub_alarm
    • id
    • alarm_id
    • function
    • metric_name
    • operator
    • threshold
    • period
    • periods
    • state
    • created_at
    • updated_at
  • sub_alarm_dimension
    • sub_alarm_id
    • dimension_name
    • value

Events

Support for real-time event stream processing in Monasca is in progress. For more details see the link at, Monasca/Events.

Logging

Support for logging in Monasca is under discussion. For more details see the link at, Monasca/Logging.

Transform and Aggregation Engine

For more details see the link at, Monasca/Transform.

Analytics

Support for anomaly detection and alarm clustering/correlation is in progress. For more details see the link at, Monasca/Analytics.

Monitoring

Enablement and usage for monitoring the status of Monasca is under discussion. For more details see the link at, Monasca/Monitoring_Of_Monasca

UI/UX Support

Adding more support for common UI/UX queries is under discussion. For more details see the link at, Monasca/UI_UX_Support

Keystone Requirements

Monasca relies on keystone for running and there are requirements about which keystone configuration must exist.

  • The endpoint for the api must be registered in keystone as the 'monasca' service.
  • The api must have an admin token to use in verifying the keystone tokens it receives.
  • For each project which uses Monasca two users must exist, one will be in the 'monasca-agent' role and be used by the monasca-agent's running on machines. The other should not be in that role and can be used logging into the UI, using the CLI or for direct queries against the API.

Post Metric Sequence

This section describes the sequence of operations involved in posting a metric to the Monasca API.

Monasca Architecture Post Metric Diagram

  1. A metric is posted to the Monasca API.
  2. The Monasca API authenticates and validates the request and publishes the metric to the the Message Queue.
  3. The Persister consumes the metric from the Message Queue and stores in the Metrics Store.
  4. The Transform Engine consumes the metrics from the Message Queue, performs transform and aggregation operations on metrics, and publishes metrics that it creates back to Message Queue.
  5. The Threshold Engine consumes metrics from the Message Queue and evaluates alarms. If a state change occurs in an alarm, an "alarm-state-transitioned-event" is published to the Message Queue.
  6. The Notification Engine consumes "alarm-state-transitioned-events" from the Message Queue, evaluates whether they have a Notification Method associated with it, and sends the appropriate notification, such as email.
  7. The Persister consumes the "alarm-state-transitioned-event" from the Message Queue and stores it in the Alarm State History Store.

Alarm Managers

This section describes the new features including alarm grouping, inhibition and silencing.

Examples:

  • Silence

SilenceRule1 = '{"alarm-silencing-definition-created": {"name": "silence_rule_1", "matchers": {"severity": "LOW"}, "start_time": "2017-02-21 20:00:00", "end_time": "2017-02-21 22:00:00"}}' Two alarm transitions: AT1 and AT2 AT1_severity = HIGH AT2_severity = LOW

Output: AT2 get silenced and AT1 send a notification. SilenceRule1 expires after "end_time"="2017-02-21 22:00:00".

  • Inhibit

InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"], "exclusions": {"alarm_name": "vm is dead"}}}' Three alarm transitions: AT1, AT2 and AT3 AT1_tenantId = "d42bc" AT2_tenantId = "d42bc" AT3_tenantId = "d42bc" AT1_severity = HIGH AT2_severity = LOW AT3_severity = LOW AT1_alarm_name = "cpu high" AT2_alarm_name = "memory high" AT3_alarm_name = "vm is dead" AT1_state = ALARM AT2_state = ALARM AT3_state = ALARM

Output: AT1 is the source alarm which will send a notification. AT2 is the target alarm and will get inhibited. AT3 matches the exclusions and will send a notification immediately.

  • Grouping Scenario 1

GroupingRule1 = '{"alarm-grouping-definition-created": {"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "exclusions": {"alarm_name": "cpu_percent_high"}, "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'

Three alarm transitions: AT1, AT2 and AT3

AT1_hostname = host1 AT2_hostname = host1 AT3_hostname = host2 AT1_alarm_name = cpu_percent_high AT2_alarm_name = cpu_system_perc_high AT3_alarm_name = cpu_percent_high AT1_state = ALARM AT2_state = ALARM AT3_state = ALARM

Output: AT1 and AT3 match exclusions and send notifications immediately. Generate a grouped notification “group_notification_rule_1_host1_alarm[1]” and send out using alarm_actions ["cd892"]. There are no alarm_actions, ok_actions or undermined_actions associated with the AT1, AT2, AT3 alarm definitions.

  • Grouping Scenario 2

GroupingRule2 = '{"alarm-grouping-definition-created": {"name": "group_rule_2", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'

Three alarm transitions: AT1 with "alarm_actions": ["123ab"], AT2 with "alarm_actions": ["cd839"] and AT3

AT1_hostname = host1 AT2_hostname = host1 AT3_hostname = host2 AT1_state = ALARM AT2_state = ALARM AT3_state = ALARM

Output: Generate two grouped notifications “group_notification_rule_2_host1_alarm[2]” and “group_notification_rule_2_host2_alarm[1]”. Both using alarm_actions ["cd892"]. Also since AT1 and AT2 has their own alarm actions associated with them, there will be two more notifications sent out.

  • Silenced and Grouped

SilenceRule1 = '{"alarm-silencing-definition-created": {"name": "silence_rule_1", "matchers": {"severity": "LOW"}, "start_time": "1487269470498", "end_time": "1587269470498"}}'

GroupingRule2 = '{"alarm-grouping-definition-created": {"name": "group_rule_2", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'

Four alarm transitions: AT1, AT2, AT3 and AT4

Silencing rule: AT1_severity = HIGH AT2_severity = LOW AT3_severity = HIGH AT4_severity = HIGH

Grouping rule: AT1_hostname = host1 AT2_hostname = host1 AT3_hostname = host2 AT4_hostname = host1 AT1_state = ALARM AT2_state = ALARM AT3_state = OK AT4_state = ALARM

Output: Generate two grouped notifications “group_notification_rule_2_host1_alarm[2]” using alarm action "cd892" and “group_notification_rule_2_host2_ok[1]” using ok action "ad892". AT2 is silenced so it isn't include in group_notification_rule_2_host1 count. There are no alarm_actions, ok_actions or undermined_actions associated with the AT1, AT2, AT3, AT4 alarm definitions.

  • Silenced and inhibited (source alarm get silenced)

SilenceRule2 = '{"alarm-silencing-definition-created": {"name": "silence_rule_2", "matchers": {"severity": "HIGH, "hostname": "host1"}, "start_time": "2017-02-21 15:00:00", "end_time": "2017-02-21 21:00:00"}}'

InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"]}}'

Alarm transitions: AT1, AT2

Inhibition rule: AT1_state = ALARM AT2_state = ALARM AT1_tenantId = "d42bc" AT2_tenantId = "d42bc" AT1_severity = HIGH AT2_severity = LOW

Silencing rule: AT1_hostname = host1 AT2_hostname = host2 AT1_severity = HIGH AT2_severity = LOW

Output: no notification sent out. For inhibition, AT1 is the source alarm, AT2 is the target alarm. But at the same time, AT1 get silenced because it matches the silence rule.

  • Inhibited and grouped

InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"]}}'

GroupingRule1 = '{"alarm-grouping-definition-created": {"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163", "repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'

Alarm transitions: AT1, AT2, AT3

Inhibition rule: AT1_tenantId = "d42bc" AT2_tenantId = "d42bc" AT3_tenantId = "d42bc" AT1_severity = HIGH AT2_severity = LOW AT3_severity = HIGH AT1_state = OK AT2_state = ALARM AT3_state = ALARM

Grouping rule: AT1_hostname = host1 AT2_hostname = host2 AT3_hostname = host1 AT1_state = OK AT2_state = ALARM AT3_state = ALARM

Output: AT2 gets inhibited because its severity is low. AT3 is the source alarm. Since AT1 is in OK state, it is not a source alarm. For grouping, AT1 and AT3 has the same host name but different state. So there will be two grouped notifications sent out: “group_notification_rule_1_host1_ok[1]” and “group_notification_rule_1_host1_alarm[1]”.

  • Silenced, inhibited and grouped

SilenceRule2 = '{"alarm-silencing-definition-created": {"name": "silence_rule_2", "matchers": {"severity": "HIGH, "hostname": "host1"}, "start_time": "2017-02-21 15:00:00", "end_time": "2017-02-21 21:00:00"}}'

InhibitionRule1 = '{"alarm-inhibition-definition-created": {"name": "inhibit_rule_1","source":{"severity":"HIGH"},"target":{"severity":"LOW"}, "equals":["tenantId"]}}'

GroupingRule1 = '{"alarm-grouping-definition-created": {"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163", "repeat_interval": "2h", "group_wait": "30s", "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'

Alarm transitions: AT1, AT2, AT3 and AT5

Silence rule: AT1_severity = HIGH AT1_hostname = host1 (silenced) AT2_severity = LOW AT2_hostname = host2 AT3_severity = HIGH AT3_hostname = host1(silenced) AT5_severity = HIGH AT5_hostname = host3

Inhibition rule: AT1_tenantId = "d42bc" AT2_tenantId = "d42bc" AT3_tenantId = "d42bc" AT5_tenantId = "d42bc" AT1_state = ALARM AT2_state = ALARM AT3_state = OK AT5_state = UNDETERMINED AT1_severity = HIGH (source) AT2_severity = LOW (target) AT3_severity = HIGH AT5_severity = HIGH

Grouping rule: AT1_hostname = host1 AT2_hostname = host2 AT3_hostname = host1 AT5_hostname = host3

Output: AT1 is in "group_notification_rule_1_host1_alarm" group and silenced. AT3 is in "group_notification_rule_1_host1_ok" group and silenced. AT2 is in "group_notification_rule_1_host2_alarm" group and inhibited. AT5 is in "group_notification_rule_1_host3_undetermined" group and will send notification “group_notification_rule_1_host3_undetermined[1]” using undetermined action "cf892".

Development Environment

Coding Standards

  • Java: Several of the Monasca components are available as Java. OpenStack does not have any Java coding standards. We've adopted the Google Java Style at, https://google.github.io/styleguide/javaguide.html.
    • The standard says either 80 or 100 length lines. We've adopted 100.

Technologies

Monasca uses a number of third-party technologies:

  • Apache Kafka (http://kafka.apache.org): Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Kafka is a highly performant, distributed, fault-tolerant, and scalable message queue with durability built-in.
  • Apache Storm (http://storm.incubator.apache.org/): Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  • MySQL: MySQL is supported as a Config Database.
  • POSTgres: Support for POSTgres, via Hibernate and SQLAlchemy, for the Config Database.
  • Vagrant (http://www.vagrantup.com/): Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team.
  • Dropwizard (https://dropwizard.github.io/dropwizard/): Dropwizard pulls together stable, mature libraries from the Java ecosystem into a simple, light-weight package that lets you focus on getting things done. Dropwizard has out-of-the-box support for sophisticated configuration, application metrics, logging, operational tools, and much more, allowing you and your team to ship a production-quality web service in the shortest time possible.
  • InfluxDB (http://influxdb.com/): An open-source distributed time series database with no external dependencies. InfluxDB is supported for the Metrics Database.
  • Vertica (http://www.vertica.com): A commercial Enterprise class SQL analytics database that is highly scalable. It offers built-in automatic high-availability and excels at in-database analytics and compressing and storing massive amounts of data. A free community version of Vertica is available that can store up to 1 TB of data with no time-limit is available at, https://my.vertica.com/community/. Vertica is supported for the Metrics Database.
  • Cassandra: Support for Cassandra for the Metrics Database is in progress.

License

Copyright (c) 2014, 2015 Hewlett-Packard Development Company, L.P.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


Monasca uses YourKit Profiler for Java development





Visit YourKit website for more information