Monitoring

This page summarizes the discussions that have happened during the Operations/Meetups:

Monitoring Categories

Traditional service monitoring
Tenant health monitoring
Expose this monitoring to tenants
Provide monitoring as a service for tenants

Tools

Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.

Monitoring Pains

Operators are generally unhappy with monitoring OpenStack in general
There are too many tools
There are too many places in OpenStack that require monitoring
It's not clear how and what to monitor
The amount of monitoring will have a performance impact on the whole cloud
How do you learn information about a host / service that needs monitored beforehand?
- Central authority for host information?
- How can this information be easily / automatically discovered?
- instance metadata for monitoring instances
  - Puppet, Chef, etc

Solutions?

Create and share best practices
Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
- Lots of effort but potential high reward
- Have this list available for each major release

Current Known Practices

These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.

Service Daemons

The following table lists the service daemons of projects which are part of the OpenStack integrated release.

Project	Component	Notes
Glance	glance-api	glance-api is a server daemon that provides an API for image storage, retrieval, and discovery.
Glance	glance-registry	glance-registry is a server daemon that provides storage, processing, and retrieval for metadata associated with images.
Keystone	keystone-server	keystone-server is a server daemon that provides identity management services.
Neutron	neutron-server	neutron-server is a server daemon that provides a webserver which exposes the Neutron API, and passes all webservice calls to the Neutron plugin for processing.
Nova	nova-api	nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads.
Nova	nova-cert	nova-cert is a server daemon that serves the Nova Cert service for X509 certificates. Used to generate certificates for euca-bundle-image. Only needed for EC2 API.
Nova	nova-conductor	nova-conductor is a server daemon that serves the Nova Conductor service, which provides coordination and database query support for Nova.
Nova	nova-consoleauth	nova-consoleauth is a server daemon that provides authentication for Nova consoles.
Nova	nova-nonvncproxy	nova-nonvncproxy is a server daemon that provides access to OpenStack Nova novnc consoles.
Nova	nova-scheduler	nova-scheduler is a server daemon that is responsible for choosing a compute node to run a VM instance.

What To Monitor

Service daemons
- nova-compute, glance-registry, etc
- Ensure they are running
- Ensure they are functioning
  - Tempest, Rally
Instance reachability
- iptables / nat
attached volumes
left-open iSCSI sessions
RabbitMQ queue depth and health
OVS tunnels
- http://www.jaddog.org/2014/09/15/monitoring-ovs-tunnels/
Functionality of all actions which a tenant can perform
Canary instances

Kernel logs

Half of bad things for compute nodes reported only to kernel log:

OOM for qemu (instance getting 'shutdown' state without tenant will)
disk IO errors
network flapping (link up/down for interfaces)
MCE (machine check errors, like memory and processor errors)
segmentation faults for ovs-vswitchd

netconsole is nice way to gather that logs and reacts to them.

Monitoring Mediums

SNMP
NRPE
Consul
Ansible / SSH

Alert Mediums

Email
SMS
Email to SMS
Jabber / XMPP
IRC
NOC Consoles (details?)
Ticket systems
VoIP

Healthchecks

Audience definition: a way to query a service and have it report back if it’s “OK” or not

Is it trustworthy?
How is the “OK” being determined?
Would operators rely on it or still double-check everything that the service is checking itself?

statsd

Would operators want each OpenStack service to provide stated metrics like Swift?

“Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
Most users are already polling their own metrics

Action Items

Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.

Start creating a list of “things” that need monitored in each OpenStack component
- Does not need to be a _complete_ list — that will always be an ongoing project
Create example alerts for each of those items
- Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
A lot of this already exists, but is scattered in a lot of different places.
Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?

Operations/Monitoring

Contents