Related to this blueprint
The compute.manager has a large number of periodic tasks that collect data about the running host. In large deployments it would be unwise to emit this data for every instance on the host as it would quickly saturate the system. Instead we are proposing to provide a means where an in-service plugin can be called to process the collected data locally (and in-memory) and only report on exceptional cases.
The typical use-case for this would be QoS and Alarming. If we were to see customer that has a 5Gb pipe has been running at 20Gb, we'd like to catch that early. Likewise, if a customer is running at 100% CPU on a 4 core image, that should be reported. Alternatively, the plugin could be as simple as taking this collected data and emitting it to a reporting tool like statsd/graphite via UDP.
The biggest challenge here is finding a way to update the plugin's configuration after the compute node has started. We don't want to restart the compute node everytime a high-watermark is moved.
Periodic Tasks in compute.manager
(the notes about emitting notifications are just for future reference)
- update_available_resource() - yes
- potentially also call with resource_tracker information that includes requested resources (not just what hypervisor is reporting)
- actually called in resource_tracker.update_available_resource()
- _cleanup_running_deleted_instances - Should emit notification
- _run_image_cache_manager_pass - Should emit notification
- _run_pending_deletes - Should emit notification
- _check_instance_build_time - Should emit notification
- _heal_instance_info_cache - maybe?
- _poll_rebooting_instances - Might emit notification on anomalies
- _poll_rescued_instances - Might emit notification on anomalies
- _poll_unconfirmed_resizes - Might emit notification on anomalies
- _poll_shelved_instances - meh
- _instance_usage_audit - maybe, but unlikely
- _poll_bandwidth_usage - yes
- _poll_volume_usage - yes
- _sync_power_states - calls virt.get_info() which has mem/max_mem info ... dunno?
- _reclaim_queued_deletes - meh
You don't need to override all of these methods. They'll only get called if they exist in the plugin.
class MetricPlugin(object): """Abstract base class for Metric Plugins.⋅ See bp: host-metric-hook for more information. """ def on_cpu(self, instance, cpu_dict):⋅ """Called after hypervisor cpu stat polls. """ pass def on_volume(self, instance, volume_dict):⋅ """Called after volume stat polls. """ pass def on_bandwidth(self, instance, bandwidth_dict): """Called after network bandwidth stat polls. """ pass def on_ram(self, instance, ram_info): """Called after hypervisor RAM polls. """ pass
What to do with measurements?
- Send samples to Ceilometer via UDP
- Emit over/under alerts via notifications.
Configuration / Restarts
- How to configure plugins?
- How to update configurations without restarting compute node?