Monasca/Monitoring Of Monasca

Default alarm severity and descriptions
Out of the box general purpose monitoring metrics and alarms available for all parts (services, applications, OS) that make up a Monasca installation.
A dashboard for the Monasca specific components to monitor the health.
Each component should have metrics to give a view of the service that is useful for thresholds, debugging and capacity planning
CLI tools to complement the UI capable of displaying Monasca details
- monasca-collector info
- monasca-forwarder info

Metrics
- Pre-configured Alarm definitions for all core services with reasonable general purpose thresholds

Easy to see if the service is up or down
Status, capacity, throughput, and latency with reasonable defaults out of the box
Standard convention for metrics with some reserved names monasca-agent

There are exceptions when there are shared components like MySQL where other OpenStack components might be influencing performance or availability. The shared database would be generically labeled and not specifically identified as a Monasca component.

User Stories

As an end user the first thing I want to see after installing Monasca is a dashboard showing the status, capacity, and latency of my Monasca installation.
As an end user deploying Monasca either individually, via CI, Vagrant, or using the installer, I want an initial dashboard showing the status of Monasca.
As an operator I want a simple and concise view of the health of the Monasca service.
As an operator or provider I want metrics for all Monasca components that will describe the status, capacity, and latency of each component.

StackForge / OpenStack

Blueprints

Bugs

Commits

Dashboard [1]

Grafana board [2]

Moved to setting up alarms with a role so they can be used more widely [3]

Added the default alarms role [4]

New monasca-vagrant role for global alarms [5]

Apache Storm and Threshold Engine StatsD monitoring [6]

Ansible config file update pull request for Storm/Thresh Engine [7]

Grafana board update for Storm and Threshold Engine [8]

Vertica plugin [9]

InfluxDB plugin [10]

Dropwizard plugin (API, Persister, Thresh) [11]

Grafana board (Vertica, Persister, InfluxDB, API) [12]

Future Feature Considerations

Support for adding dimensions as a list

Notes

Consistent metrics namespace. Currently classname.variablename using the dropwizard metrics default

Consistent metrics mechanism where it makes sense. Statsd is a candidate globally.

Use statsd for all the Monasca components (not the off the shelf components). Primary component to deliver metrics should be statsd and we can optionally provide further metrics via dropwizard or any other mechanism, however, they would generally be used for debug, development.

Metrics libraries currently used and available

Java: statsd, dropwizard

Python: yammer metrics library

Measurements

Messages per second. Alarm trigger if it is below threshold.

Architectural Components

Off the shelf open components

Apache Kafka (message queue)
MySQL (alarm, notifications database)
InfluxDB (metrics, logging, events database)
Apache Storm (realtime stream processor)
Apache Zookeeper (resource coordinator)
Operating System

Monasca components

API
Agent
Notification engine
Threshold engine
Persister

	Alarm Definition Name	Category	Provider	Component	Subcomponent	Type (status, capacity, throughput, latency)	Measurement
1	HTTP Status Alarm	System	Application	Monasca	API	Status	Up / Down
2	Host Alive Alarm	System	OS	Processor	Hardware	Status	Up / Down
3	Disk Usage	System	OS	Disk	Hardware	Capacity	Percentage
4	Disk Inode Usage	System	OS	Disk	Hardware	Capacity	Percentage
5	High CPU Usage	System	OS	Processor	Hardware	Capacity	Percentage
6	Network Errors	System	OS	Network	Hardware	Status	Count
7	Memory Usage	System	OS	Memory	Hardware	Capacity	Percentage
8	Kafka Consumer Lag	Monasca	Application	Message Queue	Consumer	Latency	Time
9	Monasca Agent emit time	Monasca	Application	Monasca	Agent	Latency	Time
10	Monasca Notification Configuration DB query time	Monasca	Application	Monasca	Notification	Latency	Time
11	Monasca Agent collection time	Monasca	Application	Monasca	Agent	Latency	Time
12	Zookeeper Average Latency	Monasca	Application	Resource Coordinator	?	Latency	Time
13	Monasca Notification email time	Monasca	Application	Monasca	Notification	Latency	Time
14	Process not found	System	OS	Processor	Process	Status	Count
15	VM Cpu usage	OpenStack	OS	Processor	Hardware	Capacity	Percentage

Component Status

Agent

Collection time (existing)
Emit time (existing)
Message error rate needs to be added. Add error count and rate and alarm on the rate. Only one needs to be alarmed.
Future possibly performance number_of_messages_sent

Move the metric from the collector to the forwarder. Would be a much more useful measurement.

Keystone auth errors need to added. Tells us if there's an authentication problem.

API

Goal to have the same metrics for Java and Python currently to share alarms.
Current Python API does not have metrics.
Current metric is status (UP/DOWN)

Notification engine

Currently using statsd

Threshold engine

ack-count.[COMPONENT]-bolt_default
ack-count.metrics-spout_default
emit-count.alarm-creation-stream
execute-count.[COMPONENT]-bolt_default
execute-count.event-spout_default
execute-count.filtering-bolt_alarm-creation-stream
execute-count.filtering-bolt_default
execute-count.metrics-spout_default
execute-latency.[COMPONENT]-bolt_default
execute-latency.event-spout_default
execute-latency.filtering-bolt_alarm-creation-stream
execute-latency.filtering-bolt_default
execute-latency.metrics-spout_default
process-latency.[COMPONENT]-bolt_default
process-latency.metrics-spout_default
transfer-count.alarm-creation-stream

Persister

Goal to have the same metrics for Java and Python currently to share alarms.
Current Python API does not have metrics.
Current metric is status (UP/DOWN)

Operating System

Currently has plugin

MySQL (alarm, notifications database)

Lots of existing metrics so none needed
Currently has plugin

Apache Kafka (message queue)

Lots of existing metrics so none needed
Currently has plugin

Apache Zookeeper (resource coordinator)

Lots of existing metrics so none needed
Currently has plugin

InfluxDB (metrics, logging, events database)

No metrics at all
TBD future

Apache Storm (realtime stream processor)

Also has UI and metrics enabled
GC_ConcurrentMarkSweep.count
GC_ConcurrentMarkSweep.timeMs
GC_ParNew.count
GC_ParNew.timeMs
ack-count.system_tick
emit-count.default
emit-count.metrics
emit-count.system
execute-count.system_tick
execute-latency.system_tick
memory_heap.committedBytes
memory_heap.initBytes
memory_heap.maxBytes
memory_heap.unusedBytes
memory_heap.usedBytes
memory_heap.virtualFreeBytes
memory_nonHeap.committedBytes
memory_nonHeap.initBytes
memory_nonHeap.maxBytes
memory_nonHeap.unusedBytes
memory_nonHeap.usedBytes
memory_nonHeap.virtualFreeBytes
newWorkerEvent
process-latency.system_tick
receive.capacity
receive.population
receive.read_pos
receive.write_pos
sendqueue.capacity
sendqueue.population
sendqueue.read_pos
sendqueue.write_pos
startTimeSecs
transfer-count.default
transfer-count.metrics
transfer-count.system
transfer.capacity
transfer.population
transfer.read_pos
transfer.write_pos
uptimeSecs