Jump to: navigation, search

Operations/Monitoring

< Operations
Revision as of 17:27, 16 September 2016 by Vageli (talk | contribs) (added important metrics to monitor)

Monitoring

This page summarizes the discussions that have happened during the Operations/Meetups:

Monitoring Categories

  • Traditional service monitoring
  • Tenant health monitoring
  • Expose this monitoring to tenants
  • Provide monitoring as a service for tenants

Tools

Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.

Monitoring Pains

  • Operators are generally unhappy with monitoring OpenStack in general
  • There are too many tools
  • There are too many places in OpenStack that require monitoring
  • It's not clear how and what to monitor
  • The amount of monitoring will have a performance impact on the whole cloud
  • How do you learn information about a host / service that needs monitored beforehand?
    • Central authority for host information?
    • How can this information be easily / automatically discovered?
    • instance metadata for monitoring instances
      • Puppet, Chef, etc

Solutions?

  • Create and share best practices
  • Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
    • Lots of effort but potential high reward
    • Have this list available for each major release

Current Known Practices

These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.

Service Daemons

The following table lists the service daemons of projects which are part of the OpenStack integrated release.

Project Component Notes
Glance glance-api glance-api is a server daemon that provides an API for image storage, retrieval, and discovery.
Glance glance-registry glance-registry is a server daemon that provides storage, processing, and retrieval for metadata associated with images.
Keystone keystone-server keystone-server is a server daemon that provides identity management services.
Neutron neutron-server neutron-server is a server daemon that provides a webserver which exposes the Neutron API, and passes all webservice calls to the Neutron plugin for processing.
Nova nova-api nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads.
Nova nova-cert nova-cert is a server daemon that serves the Nova Cert service for X509 certificates. Used to generate certificates for euca-bundle-image. Only needed for EC2 API.
Nova nova-conductor nova-conductor is a server daemon that serves the Nova Conductor service, which provides coordination and database query support for Nova.
Nova nova-consoleauth nova-consoleauth is a server daemon that provides authentication for Nova consoles.
Nova nova-nonvncproxy nova-nonvncproxy is a server daemon that provides access to OpenStack Nova novnc consoles.
Nova nova-scheduler nova-scheduler is a server daemon that is responsible for choosing a compute node to run a VM instance.

What To Monitor

  • Service daemons
    • nova-compute, glance-registry, etc
    • Ensure they are running
    • Ensure they are functioning
      • Tempest, Rally
  • Instance reachability
    • iptables / nat
  • attached volumes
  • left-open iSCSI sessions
  • RabbitMQ queue depth and health
  • OVS tunnels
  • Functionality of all actions which a tenant can perform
  • Canary instances

Kernel logs

Half of bad things for compute nodes reported only to kernel log:

  • OOM for qemu (instance getting 'shutdown' state without tenant will)
  • disk IO errors
  • network flapping (link up/down for interfaces)
  • MCE (machine check errors, like memory and processor errors)
  • segmentation faults for ovs-vswitchd

netconsole is nice way to gather that logs and reacts to them.

Monitoring Mediums

  • SNMP
  • NRPE
  • Consul
  • Ansible / SSH

Alert Mediums

  • Email
  • SMS
  • Email to SMS
  • Jabber / XMPP
  • IRC
  • NOC Consoles (details?)
  • Ticket systems
  • VoIP

Healthchecks

Audience definition: a way to query a service and have it report back if it’s “OK” or not

  • Is it trustworthy?
  • How is the “OK” being determined?
  • Would operators rely on it or still double-check everything that the service is checking itself?

statsd

Would operators want each OpenStack service to provide stated metrics like Swift?

  • “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
  • Most users are already polling their own metrics

Action Items

Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.

  • Start creating a list of “things” that need monitored in each OpenStack component
    • Does not need to be a _complete_ list — that will always be an ongoing project
  • Create example alerts for each of those items
    • Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
  • If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
  • A lot of this already exists, but is scattered in a lot of different places.
  • Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?

external resources