This page summarizes the discussions that have happened during the Operations/Meetups:
- Traditional service monitoring
- Tenant health monitoring
- Expose this monitoring to tenants
- Provide monitoring as a service for tenants
- Operators are generally unhappy with monitoring OpenStack in general
- There are too many tools
- There are too many places in OpenStack that require monitoring
- It's not clear how and what to monitor
- The amount of monitoring will have a performance impact on the whole cloud
- How do you learn information about a host / service that needs monitored beforehand?
- Central authority for host information?
- How can this information be easily / automatically discovered?
- instance metadata for monitoring instances
- Puppet, Chef, etc
- Create and share best practices
- Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
- Lots of effort but potential high reward
- Have this list available for each major release
Current Known Practices
These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.
The following table lists the service daemons of projects which are part of the OpenStack integrated release.
|Glance||glance-api||glance-api is a server daemon that provides an API for image storage, retrieval, and discovery.|
|Glance||glance-registry||glance-registry is a server daemon that provides storage, processing, and retrieval for metadata associated with images.|
|Keystone||keystone-server||keystone-server is a server daemon that provides identity management services.|
|Neutron||neutron-server||neutron-server is a server daemon that provides a webserver which exposes the Neutron API, and passes all webservice calls to the Neutron plugin for processing.|
|Nova||nova-api||nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads.|
|Nova||nova-cert||nova-cert is a server daemon that serves the Nova Cert service for X509 certificates. Used to generate certificates for euca-bundle-image. Only needed for EC2 API.|
|Nova||nova-conductor||nova-conductor is a server daemon that serves the Nova Conductor service, which provides coordination and database query support for Nova.|
|Nova||nova-consoleauth||nova-consoleauth is a server daemon that provides authentication for Nova consoles.|
|Nova||nova-nonvncproxy||nova-nonvncproxy is a server daemon that provides access to OpenStack Nova novnc consoles.|
|Nova||nova-scheduler||nova-scheduler is a server daemon that is responsible for choosing a compute node to run a VM instance.|
What To Monitor
- Service daemons
- nova-compute, glance-registry, etc
- Ensure they are running
- Ensure they are functioning
- Tempest, Rally
- Instance reachability
- iptables / nat
- attached volumes
- left-open iSCSI sessions
- RabbitMQ queue depth and health
- OVS tunnels
- Functionality of all actions which a tenant can perform
- Canary instances
Half of bad things for compute nodes reported only to kernel log:
- OOM for qemu (instance getting 'shutdown' state without tenant will)
- disk IO errors
- network flapping (link up/down for interfaces)
- MCE (machine check errors, like memory and processor errors)
- segmentation faults for ovs-vswitchd
netconsole is nice way to gather that logs and reacts to them.
- Ansible / SSH
- Email to SMS
- Jabber / XMPP
- NOC Consoles (details?)
- Ticket systems
Audience definition: a way to query a service and have it report back if it’s “OK” or not
- Is it trustworthy?
- How is the “OK” being determined?
- Would operators rely on it or still double-check everything that the service is checking itself?
Would operators want each OpenStack service to provide stated metrics like Swift?
- “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
- Most users are already polling their own metrics
Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.
- Start creating a list of “things” that need monitored in each OpenStack component
- Does not need to be a _complete_ list — that will always be an ongoing project
- Create example alerts for each of those items
- Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
- If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
- A lot of this already exists, but is scattered in a lot of different places.
- Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?