Difference between revisions of "Monasca/Monitoring Of Monasca"
(→Commits) |
(→Component Status) |
||
Line 130: | Line 130: | ||
'''Threshold engine''' | '''Threshold engine''' | ||
− | + | ack-count.[COMPONENT]-bolt_default | |
− | + | ack-count.metrics-spout_default | |
+ | emit-count.alarm-creation-stream | ||
+ | execute-count.[COMPONENT]-bolt_default | ||
+ | execute-count.event-spout_default | ||
+ | execute-count.filtering-bolt_alarm-creation-stream | ||
+ | execute-count.filtering-bolt_default | ||
+ | execute-count.metrics-spout_default | ||
+ | execute-latency.[COMPONENT]-bolt_default | ||
+ | execute-latency.event-spout_default | ||
+ | execute-latency.filtering-bolt_alarm-creation-stream | ||
+ | execute-latency.filtering-bolt_default | ||
+ | execute-latency.metrics-spout_default | ||
+ | process-latency.[COMPONENT]-bolt_default | ||
+ | process-latency.metrics-spout_default | ||
+ | transfer-count.alarm-creation-stream | ||
+ | |||
'''Persister''' | '''Persister''' | ||
Line 158: | Line 173: | ||
'''Apache Storm''' (realtime stream processor) | '''Apache Storm''' (realtime stream processor) | ||
− | * | + | * Also has UI and metrics enabled |
− | + | GC_ConcurrentMarkSweep.count | |
− | + | GC_ConcurrentMarkSweep.timeMs | |
+ | GC_ParNew.count | ||
+ | GC_ParNew.timeMs | ||
+ | ack-count.system_tick | ||
+ | emit-count.default | ||
+ | emit-count.metrics | ||
+ | emit-count.system | ||
+ | execute-count.system_tick | ||
+ | execute-latency.system_tick | ||
+ | memory_heap.committedBytes | ||
+ | memory_heap.initBytes | ||
+ | memory_heap.maxBytes | ||
+ | memory_heap.unusedBytes | ||
+ | memory_heap.usedBytes | ||
+ | memory_heap.virtualFreeBytes | ||
+ | memory_nonHeap.committedBytes | ||
+ | memory_nonHeap.initBytes | ||
+ | memory_nonHeap.maxBytes | ||
+ | memory_nonHeap.unusedBytes | ||
+ | memory_nonHeap.usedBytes | ||
+ | memory_nonHeap.virtualFreeBytes | ||
+ | newWorkerEvent | ||
+ | process-latency.system_tick | ||
+ | receive.capacity | ||
+ | receive.population | ||
+ | receive.read_pos | ||
+ | receive.write_pos | ||
+ | sendqueue.capacity | ||
+ | sendqueue.population | ||
+ | sendqueue.read_pos | ||
+ | sendqueue.write_pos | ||
+ | startTimeSecs | ||
+ | transfer-count.default | ||
+ | transfer-count.metrics | ||
+ | transfer-count.system | ||
+ | transfer.capacity | ||
+ | transfer.population | ||
+ | transfer.read_pos | ||
+ | transfer.write_pos | ||
+ | uptimeSecs |
Revision as of 02:26, 8 March 2015
Contents
Goals and Deliverables
- Default alarm severity and descriptions
- Out of the box general purpose monitoring metrics and alarms available for all parts (services, applications, OS) that make up a Monasca installation.
- A dashboard for the Monasca specific components to monitor the health.
- Each component should have metrics to give a view of the service that is useful for thresholds, debugging and capacity planning
- CLI tools to complement the UI capable of displaying Monasca details
- monasca-collector info
- monasca-forwarder info
- Metrics
- Pre-configured Alarm definitions for all core services with reasonable general purpose thresholds
- Easy to see if the service is up or down
- Status, capacity, throughput, and latency with reasonable defaults out of the box
- Standard convention for metrics with some reserved names monasca-agent
There are exceptions when there are shared components like MySQL where other OpenStack components might be influencing performance or availability. The shared database would be generically labeled and not specifically identified as a Monasca component.
User Stories
- As an end user the first thing I want to see after installing Monasca is a dashboard showing the status, capacity, and latency of my Monasca installation.
- As an end user deploying Monasca either individually, via CI, Vagrant, or using the installer, I want an initial dashboard showing the status of Monasca.
- As an operator I want a simple and concise view of the health of the Monasca service.
- As an operator or provider I want metrics for all Monasca components that will describe the status, capacity, and latency of each component.
StackForge / OpenStack
Blueprints
Bugs
Commits
- Dashboard [1]
- Grafana board [2]
- Moved to setting up alarms with a role so they can be used more widely [3]
- Added the default alarms role [4]
- New monasca-vagrant role for global alarms [5]
- Apache Storm and Threshold Engine StatsD monitoring [6]
Future Feature Considerations
- Support for adding dimensions as a list
Notes
- Consistent metrics namespace. Currently classname.variablename using the dropwizard metrics default
- Consistent metrics mechanism where it makes sense. Statsd is a candidate globally.
- Use statsd for all the Monasca components (not the off the shelf components). Primary component to deliver metrics should be statsd and we can optionally provide further metrics via dropwizard or any other mechanism, however, they would generally be used for debug, development.
Metrics libraries currently used and available
- Java: statsd, dropwizard
- Python: yammer metrics library
Measurements
- Messages per second. Alarm trigger if it is below threshold.
Architectural Components
Off the shelf open components
- Apache Kafka (message queue)
- MySQL (alarm, notifications database)
- InfluxDB (metrics, logging, events database)
- Apache Storm (realtime stream processor)
- Apache Zookeeper (resource coordinator)
- Operating System
Monasca components
- API
- Agent
- Notification engine
- Threshold engine
- Persister
Alarm Definition Name | Category | Provider | Component | Subcomponent | Type (status, capacity, throughput, latency) | Measurement | |
---|---|---|---|---|---|---|---|
1 | HTTP Status Alarm | System | Application | Monasca | API | Status | Up / Down |
2 | Host Alive Alarm | System | OS | Processor | Hardware | Status | Up / Down |
3 | Disk Usage | System | OS | Disk | Hardware | Capacity | Percentage |
4 | Disk Inode Usage | System | OS | Disk | Hardware | Capacity | Percentage |
5 | High CPU Usage | System | OS | Processor | Hardware | Capacity | Percentage |
6 | Network Errors | System | OS | Network | Hardware | Status | Count |
7 | Memory Usage | System | OS | Memory | Hardware | Capacity | Percentage |
8 | Kafka Consumer Lag | Monasca | Application | Message Queue | Consumer | Latency | Time |
9 | Monasca Agent emit time | Monasca | Application | Monasca | Agent | Latency | Time |
10 | Monasca Notification Configuration DB query time | Monasca | Application | Monasca | Notification | Latency | Time |
11 | Monasca Agent collection time | Monasca | Application | Monasca | Agent | Latency | Time |
12 | Zookeeper Average Latency | Monasca | Application | Resource Coordinator | ? | Latency | Time |
13 | Monasca Notification email time | Monasca | Application | Monasca | Notification | Latency | Time |
14 | Process not found | System | OS | Processor | Process | Status | Count |
15 | VM Cpu usage | OpenStack | OS | Processor | Hardware | Capacity | Percentage |
Component Status
Agent
- Collection time (existing)
- Emit time (existing)
- Message error rate needs to be added. Add error count and rate and alarm on the rate. Only one needs to be alarmed.
- Future possibly performance number_of_messages_sent
- Move the metric from the collector to the forwarder. Would be a much more useful measurement.
- Keystone auth errors need to added. Tells us if there's an authentication problem.
API
- Goal to have the same metrics for Java and Python currently to share alarms.
- Current Python API does not have metrics.
- Current metric is status (UP/DOWN)
Notification engine
- Currently using statsd
Threshold engine ack-count.[COMPONENT]-bolt_default ack-count.metrics-spout_default emit-count.alarm-creation-stream execute-count.[COMPONENT]-bolt_default execute-count.event-spout_default execute-count.filtering-bolt_alarm-creation-stream execute-count.filtering-bolt_default execute-count.metrics-spout_default execute-latency.[COMPONENT]-bolt_default execute-latency.event-spout_default execute-latency.filtering-bolt_alarm-creation-stream execute-latency.filtering-bolt_default execute-latency.metrics-spout_default process-latency.[COMPONENT]-bolt_default process-latency.metrics-spout_default transfer-count.alarm-creation-stream
Persister
- Goal to have the same metrics for Java and Python currently to share alarms.
- Current Python API does not have metrics.
- Current metric is status (UP/DOWN)
Operating System
- Currently has plugin
MySQL (alarm, notifications database)
- Lots of existing metrics so none needed
- Currently has plugin
Apache Kafka (message queue)
- Lots of existing metrics so none needed
- Currently has plugin
Apache Zookeeper (resource coordinator)
- Lots of existing metrics so none needed
- Currently has plugin
InfluxDB (metrics, logging, events database)
- No metrics at all
- TBD future
Apache Storm (realtime stream processor)
- Also has UI and metrics enabled
GC_ConcurrentMarkSweep.count GC_ConcurrentMarkSweep.timeMs GC_ParNew.count GC_ParNew.timeMs ack-count.system_tick emit-count.default emit-count.metrics emit-count.system execute-count.system_tick execute-latency.system_tick memory_heap.committedBytes memory_heap.initBytes memory_heap.maxBytes memory_heap.unusedBytes memory_heap.usedBytes memory_heap.virtualFreeBytes memory_nonHeap.committedBytes memory_nonHeap.initBytes memory_nonHeap.maxBytes memory_nonHeap.unusedBytes memory_nonHeap.usedBytes memory_nonHeap.virtualFreeBytes newWorkerEvent process-latency.system_tick receive.capacity receive.population receive.read_pos receive.write_pos sendqueue.capacity sendqueue.population sendqueue.read_pos sendqueue.write_pos startTimeSecs transfer-count.default transfer-count.metrics transfer-count.system transfer.capacity transfer.population transfer.read_pos transfer.write_pos uptimeSecs