Revision as of 12:05, 26 December 2014

Monitoring

This page summarizes the discussions that have happened during the Operations/Meetups:

Monitoring Categories

Traditional service monitoring
Tenant health monitoring
Expose this monitoring to tenants
Provide monitoring as a service for tenants

Tools

Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.

Monitoring Pains

Operators are generally unhappy with monitoring OpenStack in general
There are too many tools
There are too many places in OpenStack that require monitoring
It's not clear how and what to monitor
The amount of monitoring will have a performance impact on the whole cloud
How do you learn information about a host / service that needs monitored beforehand?
- Central authority for host information?
- How can this information be easily / automatically discovered?
- instance metadata for monitoring instances
  - Puppet, Chef, etc

Solutions?

Create and share best practices
Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
- Lots of effort but potential high reward
- Have this list available for each major release

Current Known Practices

These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.

What To Monitor

Service daemons
- nova-compute, glance-registry, etc
- Ensure they are running
- Ensure they are functioning
  - Tempest, Rally
Instance reachability
- iptables / nat
attached volumes
left-open iSCSI sessions
RabbitMQ queue depth and health
OVS tunnels
- http://www.jaddog.org/2014/09/15/monitoring-ovs-tunnels/
Functionality of all actions which a tenant can perform
Canary instances

Kernel logs

Half of bad things for compute nodes reported only to kernel log:

OOM for qemu (instance getting 'shutdown' state without tenant will)
disk IO errors
network flapping (link up/down for interfaces)
MCE (machine check errors, like memory and processor errors)

netconsole is nice way to gather that logs and reacts to them.

Monitoring Mediums

SNMP
NRPE
Consul
Ansible / SSH

Alert Mediums

Email
SMS
Email to SMS
Jabber / XMPP
IRC
NOC Consoles (details?)
Ticket systems
VoIP

Healthchecks

Audience definition: a way to query a service and have it report back if it’s “OK” or not

Is it trustworthy?
How is the “OK” being determined?
Would operators rely on it or still double-check everything that the service is checking itself?

statsd

Would operators want each OpenStack service to provide stated metrics like Swift?

“Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
Most users are already polling their own metrics

Action Items

Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.

Start creating a list of “things” that need monitored in each OpenStack component
- Does not need to be a _complete_ list — that will always be an ongoing project
Create example alerts for each of those items
- Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
A lot of this already exists, but is scattered in a lot of different places.
Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?

external resources

https://github.com/osops/tools-monitoring

@@ Line 57: / Line 57: @@
 * Functionality of all actions which a tenant can perform
 * Canary instances
+== Kernel logs ==
+Half of bad things for compute nodes reported only to kernel log:
+* OOM for qemu (instance getting 'shutdown' state without tenant will)
+* disk IO errors
+* network flapping (link up/down for interfaces)
+* MCE (machine check errors, like memory and processor errors)
+netconsole is nice way to gather that logs and reacts to them.
 === Monitoring Mediums ===

Difference between revisions of "Operations/Monitoring"