Operations/Monitoring

= Monitoring =

This page summarizes the discussions that have happened during the Operations/Meetups:


 * https://etherpad.openstack.org/p/juno-summit-ops-monitoringlogging
 * https://etherpad.openstack.org/p/kilo-summit-ops-monitoring

Monitoring Categories

 * Traditional service monitoring
 * Tenant health monitoring
 * Expose this monitoring to tenants
 * Provide monitoring as a service for tenants

Tools
Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.

Monitoring Pains

 * Operators are generally unhappy with monitoring OpenStack in general
 * There are too many tools
 * There are too many places in OpenStack that require monitoring
 * It's not clear how and what to monitor
 * The amount of monitoring will have a performance impact on the whole cloud
 * How do you learn information about a host / service that needs monitored beforehand?
 * Central authority for host information?
 * How can this information be easily / automatically discovered?
 * instance metadata for monitoring instances
 * Puppet, Chef, etc

Solutions?

 * Create and share best practices
 * Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
 * Lots of effort but potential high reward
 * Have this list available for each major release

Current Known Practices
These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.

Service Daemons
The following table lists the service daemons of projects which are part of the OpenStack integrated release.

What To Monitor

 * Service daemons
 * nova-compute, glance-registry, etc
 * Ensure they are running
 * Ensure they are functioning
 * Tempest, Rally
 * Instance reachability
 * iptables / nat
 * attached volumes
 * left-open iSCSI sessions
 * RabbitMQ queue depth and health
 * OVS tunnels
 * http://www.jaddog.org/2014/09/15/monitoring-ovs-tunnels/
 * Functionality of all actions which a tenant can perform
 * Canary instances

Kernel logs
Half of bad things for compute nodes reported only to kernel log: netconsole is nice way to gather that logs and reacts to them.
 * OOM for qemu (instance getting 'shutdown' state without tenant will)
 * disk IO errors
 * network flapping (link up/down for interfaces)
 * MCE (machine check errors, like memory and processor errors)
 * segmentation faults for ovs-vswitchd

Monitoring Mediums

 * SNMP
 * NRPE
 * Consul
 * Ansible / SSH

Alert Mediums

 * Email
 * SMS
 * Email to SMS
 * Jabber / XMPP
 * IRC
 * NOC Consoles (details?)
 * Ticket systems
 * VoIP

Healthchecks
Audience definition: a way to query a service and have it report back if it’s “OK” or not


 * Is it trustworthy?
 * How is the “OK” being determined?
 * Would operators rely on it or still double-check everything that the service is checking itself?

statsd
Would operators want each OpenStack service to provide stated metrics like Swift?


 * “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
 * Most users are already polling their own metrics

Action Items
Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.


 * Start creating a list of “things” that need monitored in each OpenStack component
 * Does not need to be a _complete_ list — that will always be an ongoing project
 * Create example alerts for each of those items
 * Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
 * If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
 * A lot of this already exists, but is scattered in a lot of different places.
 * Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?

external resources

 * https://github.com/openstack/osops-tools-monitoring
 * [//www.datadoghq.com/blog/openstack-monitoring-nova/?ref=wikipedia Key nova-compute metrics to monitor]