Difference between revisions of "Operations/Monitoring"
(Page Creation) |
(→What To Monitor) |
||
Line 57: | Line 57: | ||
* Functionality of all actions which a tenant can perform | * Functionality of all actions which a tenant can perform | ||
* Canary instances | * Canary instances | ||
+ | == Kernel logs == | ||
+ | Half of bad things for compute nodes reported only to kernel log: | ||
+ | * OOM for qemu (instance getting 'shutdown' state without tenant will) | ||
+ | * disk IO errors | ||
+ | * network flapping (link up/down for interfaces) | ||
+ | * MCE (machine check errors, like memory and processor errors) | ||
+ | netconsole is nice way to gather that logs and reacts to them. | ||
=== Monitoring Mediums === | === Monitoring Mediums === |
Revision as of 12:05, 26 December 2014
Contents
Monitoring
This page summarizes the discussions that have happened during the Operations/Meetups:
- https://etherpad.openstack.org/p/juno-summit-ops-monitoringlogging
- https://etherpad.openstack.org/p/kilo-summit-ops-monitoring
Monitoring Categories
- Traditional service monitoring
- Tenant health monitoring
- Expose this monitoring to tenants
- Provide monitoring as a service for tenants
Tools
Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.
Monitoring Pains
- Operators are generally unhappy with monitoring OpenStack in general
- There are too many tools
- There are too many places in OpenStack that require monitoring
- It's not clear how and what to monitor
- The amount of monitoring will have a performance impact on the whole cloud
- How do you learn information about a host / service that needs monitored beforehand?
- Central authority for host information?
- How can this information be easily / automatically discovered?
- instance metadata for monitoring instances
- Puppet, Chef, etc
Solutions?
- Create and share best practices
- Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
- Lots of effort but potential high reward
- Have this list available for each major release
Current Known Practices
These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.
What To Monitor
- Service daemons
- nova-compute, glance-registry, etc
- Ensure they are running
- Ensure they are functioning
- Tempest, Rally
- Instance reachability
- iptables / nat
- attached volumes
- left-open iSCSI sessions
- RabbitMQ queue depth and health
- OVS tunnels
- Functionality of all actions which a tenant can perform
- Canary instances
Kernel logs
Half of bad things for compute nodes reported only to kernel log:
- OOM for qemu (instance getting 'shutdown' state without tenant will)
- disk IO errors
- network flapping (link up/down for interfaces)
- MCE (machine check errors, like memory and processor errors)
netconsole is nice way to gather that logs and reacts to them.
Monitoring Mediums
- SNMP
- NRPE
- Consul
- Ansible / SSH
Alert Mediums
- SMS
- Email to SMS
- Jabber / XMPP
- IRC
- NOC Consoles (details?)
- Ticket systems
- VoIP
Healthchecks
Audience definition: a way to query a service and have it report back if it’s “OK” or not
- Is it trustworthy?
- How is the “OK” being determined?
- Would operators rely on it or still double-check everything that the service is checking itself?
statsd
Would operators want each OpenStack service to provide stated metrics like Swift?
- “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
- Most users are already polling their own metrics
Action Items
Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.
- Start creating a list of “things” that need monitored in each OpenStack component
- Does not need to be a _complete_ list — that will always be an ongoing project
- Create example alerts for each of those items
- Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
- If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
- A lot of this already exists, but is scattered in a lot of different places.
- Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?