Large Scale SIG/Monitor
The second stage in the Scaling Journey is Monitor.
Once you have properly configured your cluster to handle scale, you will need to properly monitor it for signs of load stress. Monitoring in OpenStack can be a bit overwhelming and it's sometimes hard to determine how to meaningfully monitor your deployment to get advance warning of when load is just too high. This page aims to help answer those questions.
Once meaningful monitoring is in place, you are ready to proceed to the third stage of the Scaling Journey: Scale Up.
Q: How can I detect that RabbitMQ is a bottleneck ?
A: oslo.metrics will introduce monitoring for rpc calls, currently under development. RabbitMQ node CPU and RAM usage is also a indicator that your RabbitMQ cluster is overloaded, if you find CPU or RAM usage high, you should scale up/out RabbitMQ nodes.
Q: How can I detect that database is a bottleneck ?
A: oslo.metrics will also integrate oslo.db as the next step after oslo.messaging
Q: How can I track latency issues ?
A: If you have a load balancer or proxy in front of your OpenStack API servers (e.g. haproxy, nginx) you can monitor API latencies based on the metrics provided by those services.
Q: How can I track traffic issues ?
Q: How do I track error rates ?
- For http requests error rates, you can check with the same method you track latency by using the metrics from the proxy.
- For backend error rate, monitoring tools like Logstash or Fluentd are able to track error level outputs in OpenStack log file.
Q: How do I track saturation issues ?
- oslo.metrics code and documentation
- Learn about golden signals (latency, traffic, errors, saturation) in the Google SRE book
Other SIG work on that stage
- Measurement of MQ behavior through oslo.metrics
- Approved spec for oslo.metrics: https://review.opendev.org/#/c/704733/
- Code up at https://opendev.org/openstack/oslo.metrics/
- 0.1.0 initial release done
- Get to a 1.0 release
- oslo-messaging metrics code https://review.opendev.org/#/c/761848/ (genekuo)
- Enable bandit (issue to fix with predictable path for metrics socket ?)
- Improve tests to get closer to 100% coverage