Jump to: navigation, search

Difference between revisions of "Operations/Monitoring"

m (Kernel logs)
m (Kernel logs)
Line 63: Line 63:
 
* network flapping (link up/down for interfaces)
 
* network flapping (link up/down for interfaces)
 
* MCE (machine check errors, like memory and processor errors)
 
* MCE (machine check errors, like memory and processor errors)
 +
* segmentation faults for ovs-vswitchd
 
netconsole is nice way to gather that logs and reacts to them.
 
netconsole is nice way to gather that logs and reacts to them.
  

Revision as of 12:06, 26 December 2014

Monitoring

This page summarizes the discussions that have happened during the Operations/Meetups:

Monitoring Categories

  • Traditional service monitoring
  • Tenant health monitoring
  • Expose this monitoring to tenants
  • Provide monitoring as a service for tenants

Tools

Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.

Monitoring Pains

  • Operators are generally unhappy with monitoring OpenStack in general
  • There are too many tools
  • There are too many places in OpenStack that require monitoring
  • It's not clear how and what to monitor
  • The amount of monitoring will have a performance impact on the whole cloud
  • How do you learn information about a host / service that needs monitored beforehand?
    • Central authority for host information?
    • How can this information be easily / automatically discovered?
    • instance metadata for monitoring instances
      • Puppet, Chef, etc

Solutions?

  • Create and share best practices
  • Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
    • Lots of effort but potential high reward
    • Have this list available for each major release

Current Known Practices

These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.

What To Monitor

  • Service daemons
    • nova-compute, glance-registry, etc
    • Ensure they are running
    • Ensure they are functioning
      • Tempest, Rally
  • Instance reachability
    • iptables / nat
  • attached volumes
  • left-open iSCSI sessions
  • RabbitMQ queue depth and health
  • OVS tunnels
  • Functionality of all actions which a tenant can perform
  • Canary instances

Kernel logs

Half of bad things for compute nodes reported only to kernel log:

  • OOM for qemu (instance getting 'shutdown' state without tenant will)
  • disk IO errors
  • network flapping (link up/down for interfaces)
  • MCE (machine check errors, like memory and processor errors)
  • segmentation faults for ovs-vswitchd

netconsole is nice way to gather that logs and reacts to them.

Monitoring Mediums

  • SNMP
  • NRPE
  • Consul
  • Ansible / SSH

Alert Mediums

  • Email
  • SMS
  • Email to SMS
  • Jabber / XMPP
  • IRC
  • NOC Consoles (details?)
  • Ticket systems
  • VoIP

Healthchecks

Audience definition: a way to query a service and have it report back if it’s “OK” or not

  • Is it trustworthy?
  • How is the “OK” being determined?
  • Would operators rely on it or still double-check everything that the service is checking itself?

statsd

Would operators want each OpenStack service to provide stated metrics like Swift?

  • “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
  • Most users are already polling their own metrics

Action Items

Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.

  • Start creating a list of “things” that need monitored in each OpenStack component
    • Does not need to be a _complete_ list — that will always be an ongoing project
  • Create example alerts for each of those items
    • Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
  • If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
  • A lot of this already exists, but is scattered in a lot of different places.
  • Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?

external resources