Difference between revisions of "Operations/Monitoring"
m (added important metrics to monitor) |
m (→external resources) |
||
(One intermediate revision by one other user not shown) | |||
Line 165: | Line 165: | ||
== external resources == | == external resources == | ||
− | * https://github.com/osops | + | * https://github.com/openstack/osops-tools-monitoring |
− | * [//www.datadoghq.com/blog/openstack-monitoring-nova | + | * [//www.datadoghq.com/blog/openstack-monitoring-nova/?ref=wikipedia Key nova-compute metrics to monitor] |
Latest revision as of 13:50, 25 March 2019
Contents
Monitoring
This page summarizes the discussions that have happened during the Operations/Meetups:
- https://etherpad.openstack.org/p/juno-summit-ops-monitoringlogging
- https://etherpad.openstack.org/p/kilo-summit-ops-monitoring
Monitoring Categories
- Traditional service monitoring
- Tenant health monitoring
- Expose this monitoring to tenants
- Provide monitoring as a service for tenants
Tools
Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.
Monitoring Pains
- Operators are generally unhappy with monitoring OpenStack in general
- There are too many tools
- There are too many places in OpenStack that require monitoring
- It's not clear how and what to monitor
- The amount of monitoring will have a performance impact on the whole cloud
- How do you learn information about a host / service that needs monitored beforehand?
- Central authority for host information?
- How can this information be easily / automatically discovered?
- instance metadata for monitoring instances
- Puppet, Chef, etc
Solutions?
- Create and share best practices
- Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
- Lots of effort but potential high reward
- Have this list available for each major release
Current Known Practices
These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.
Service Daemons
The following table lists the service daemons of projects which are part of the OpenStack integrated release.
Project | Component | Notes |
---|---|---|
Glance | glance-api | glance-api is a server daemon that provides an API for image storage, retrieval, and discovery. |
Glance | glance-registry | glance-registry is a server daemon that provides storage, processing, and retrieval for metadata associated with images. |
Keystone | keystone-server | keystone-server is a server daemon that provides identity management services. |
Neutron | neutron-server | neutron-server is a server daemon that provides a webserver which exposes the Neutron API, and passes all webservice calls to the Neutron plugin for processing. |
Nova | nova-api | nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads. |
Nova | nova-cert | nova-cert is a server daemon that serves the Nova Cert service for X509 certificates. Used to generate certificates for euca-bundle-image. Only needed for EC2 API. |
Nova | nova-conductor | nova-conductor is a server daemon that serves the Nova Conductor service, which provides coordination and database query support for Nova. |
Nova | nova-consoleauth | nova-consoleauth is a server daemon that provides authentication for Nova consoles. |
Nova | nova-nonvncproxy | nova-nonvncproxy is a server daemon that provides access to OpenStack Nova novnc consoles. |
Nova | nova-scheduler | nova-scheduler is a server daemon that is responsible for choosing a compute node to run a VM instance. |
What To Monitor
- Service daemons
- nova-compute, glance-registry, etc
- Ensure they are running
- Ensure they are functioning
- Tempest, Rally
- Instance reachability
- iptables / nat
- attached volumes
- left-open iSCSI sessions
- RabbitMQ queue depth and health
- OVS tunnels
- Functionality of all actions which a tenant can perform
- Canary instances
Kernel logs
Half of bad things for compute nodes reported only to kernel log:
- OOM for qemu (instance getting 'shutdown' state without tenant will)
- disk IO errors
- network flapping (link up/down for interfaces)
- MCE (machine check errors, like memory and processor errors)
- segmentation faults for ovs-vswitchd
netconsole is nice way to gather that logs and reacts to them.
Monitoring Mediums
- SNMP
- NRPE
- Consul
- Ansible / SSH
Alert Mediums
- SMS
- Email to SMS
- Jabber / XMPP
- IRC
- NOC Consoles (details?)
- Ticket systems
- VoIP
Healthchecks
Audience definition: a way to query a service and have it report back if it’s “OK” or not
- Is it trustworthy?
- How is the “OK” being determined?
- Would operators rely on it or still double-check everything that the service is checking itself?
statsd
Would operators want each OpenStack service to provide stated metrics like Swift?
- “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
- Most users are already polling their own metrics
Action Items
Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.
- Start creating a list of “things” that need monitored in each OpenStack component
- Does not need to be a _complete_ list — that will always be an ongoing project
- Create example alerts for each of those items
- Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
- If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
- A lot of this already exists, but is scattered in a lot of different places.
- Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?