Difference between revisions of "Operations/Monitoring"
(I am starting to enumerate the components we'll want to provide specific monitoring guidance around.) |
(Adding Keystone to list) |
||
Line 48: | Line 48: | ||
!Component | !Component | ||
!Notes | !Notes | ||
+ | |- | ||
+ | |Keystone | ||
+ | |keystone-server | ||
+ | |keystone-server is a server daemon that provides identity management services. | ||
|- | |- | ||
|Nova | |Nova | ||
|nova-api | |nova-api | ||
− | |nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads | + | |nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads. |
|- | |- | ||
|Nova | |Nova |
Revision as of 20:33, 26 March 2015
Contents
Monitoring
This page summarizes the discussions that have happened during the Operations/Meetups:
- https://etherpad.openstack.org/p/juno-summit-ops-monitoringlogging
- https://etherpad.openstack.org/p/kilo-summit-ops-monitoring
Monitoring Categories
- Traditional service monitoring
- Tenant health monitoring
- Expose this monitoring to tenants
- Provide monitoring as a service for tenants
Tools
Numerous monitoring tools exist. See the Operations/Tools page for an inventory. Also make note of Monasca which is an project aimed to create a Monitoring as a Service solution.
Monitoring Pains
- Operators are generally unhappy with monitoring OpenStack in general
- There are too many tools
- There are too many places in OpenStack that require monitoring
- It's not clear how and what to monitor
- The amount of monitoring will have a performance impact on the whole cloud
- How do you learn information about a host / service that needs monitored beforehand?
- Central authority for host information?
- How can this information be easily / automatically discovered?
- instance metadata for monitoring instances
- Puppet, Chef, etc
Solutions?
- Create and share best practices
- Inventory all areas in OpenStack that need monitored and create examples of how to monitor them
- Lots of effort but potential high reward
- Have this list available for each major release
Current Known Practices
These are items that have been collected during discussion. The actual implementation will vary. This is possibly a great starting point for the inventory mentioned above.
What To Monitor
Project | Component | Notes |
---|---|---|
Keystone | keystone-server | keystone-server is a server daemon that provides identity management services. |
Nova | nova-api | nova-api is a server daemon that serves the nova EC2 and OpenStack APIs in separate greenthreads. |
Nova | nova-cert | nova-cert is a server daemon that serves the Nova Cert service for X509 certificates. Used to generate certificates for euca-bundle-image. Only needed for EC2 API. |
Nova | nova-conductor | nova-conductor is a server daemon that serves the Nova Conductor service, which provides coordination and database query support for Nova. |
Nova | nova-consoleauth | nova-consoleauth is a server daemon that provides authentication for Nova consoles. |
Nova | nova-nonvncproxy | nova-nonvncproxy is a server daemon that provides access to OpenStack Nova novnc consoles. |
Nova | nova-scheduler | nova-scheduler is a server daemon that is responsible for choosing a compute node to run a VM instance. |
- Service daemons
- nova-compute, glance-registry, etc
- Ensure they are running
- Ensure they are functioning
- Tempest, Rally
- Instance reachability
- iptables / nat
- attached volumes
- left-open iSCSI sessions
- RabbitMQ queue depth and health
- OVS tunnels
- Functionality of all actions which a tenant can perform
- Canary instances
Kernel logs
Half of bad things for compute nodes reported only to kernel log:
- OOM for qemu (instance getting 'shutdown' state without tenant will)
- disk IO errors
- network flapping (link up/down for interfaces)
- MCE (machine check errors, like memory and processor errors)
- segmentation faults for ovs-vswitchd
netconsole is nice way to gather that logs and reacts to them.
Monitoring Mediums
- SNMP
- NRPE
- Consul
- Ansible / SSH
Alert Mediums
- SMS
- Email to SMS
- Jabber / XMPP
- IRC
- NOC Consoles (details?)
- Ticket systems
- VoIP
Healthchecks
Audience definition: a way to query a service and have it report back if it’s “OK” or not
- Is it trustworthy?
- How is the “OK” being determined?
- Would operators rely on it or still double-check everything that the service is checking itself?
statsd
Would operators want each OpenStack service to provide stated metrics like Swift?
- “Yes”, but the majority of the crowd wasn’t even using it for Swift. Why not?
- Most users are already polling their own metrics
Action Items
Completion of these items won’t solve all above-mentioned Pains, but it would be a great start.
- Start creating a list of “things” that need monitored in each OpenStack component
- Does not need to be a _complete_ list — that will always be an ongoing project
- Create example alerts for each of those items
- Would Nagios / NRPE checks be the best form of an example? It’s probably the style that the majority of operators are familiar with and is compatible with non-Nagios monitoring systems.
- If monitoring a specific service or function becomes too complex, determine if it can be implemented easier from within the service and suggest or create a blueprint for that service.
- A lot of this already exists, but is scattered in a lot of different places.
- Can someone show an example of using Rally or Tempest wrapped in a Nagios plugin?