Jump to: navigation, search

Difference between revisions of "HypervisorMonitoringPlugin"

m (Periodic Tasks in compute.manager)
Line 6: Line 6:
  
 
=== Periodic Tasks in compute.manager ===
 
=== Periodic Tasks in compute.manager ===
 +
(the notes about emitting notifications are just for future reference)
 +
 
* update_available_resource() - yes
 
* update_available_resource() - yes
 
:* potentially also call with resource_tracker information that includes requested resources (not just what hypervisor is reporting)
 
:* potentially also call with resource_tracker information that includes requested resources (not just what hypervisor is reporting)
Line 14: Line 16:
 
* _check_instance_build_time - Should emit notification
 
* _check_instance_build_time - Should emit notification
 
* _heal_instance_info_cache - maybe?
 
* _heal_instance_info_cache - maybe?
* _poll_rebooting_instances - Might emit notification on anomolies
+
* _poll_rebooting_instances - Might emit notification on anomalies
* _poll_rescued_instances - Might emit notification on anomolies
+
* _poll_rescued_instances - Might emit notification on anomalies
* _poll_unconfirmed_resizes - Might emit notification on anomolies
+
* _poll_unconfirmed_resizes - Might emit notification on anomalies
 
* _poll_shelved_instances - meh
 
* _poll_shelved_instances - meh
 
* _instance_usage_audit - maybe, but unlikely
 
* _instance_usage_audit - maybe, but unlikely

Revision as of 20:51, 8 October 2013

The compute.manager has a large number of periodic tasks that collect data about the running host. In large deployments it would be unwise to emit this data for every instance on the host as it would quickly saturate the system. Instead we are proposing to provide a means where an in-service plugin can be called to process the collected data locally (and in-memory) and only report on exceptional cases.

The typical use-case for this would be QoS and Alarming. If we were to see customer that has a 5Gb pipe has been running at 20Gb, we'd like to catch that early. Likewise, if a customer is running at 100% CPU on a 4 core image, that should be reported. Alternatively, the plugin could be as simple as taking this collected data and emitting it to a reporting tool like statsd/graphite via UDP.

The biggest challenge here is finding a way to update the plugin's configuration after the compute node has started. We don't want to restart the compute node everytime a high-watermark is moved.

Periodic Tasks in compute.manager

(the notes about emitting notifications are just for future reference)

  • update_available_resource() - yes
  • potentially also call with resource_tracker information that includes requested resources (not just what hypervisor is reporting)
  • actually called in resource_tracker.update_available_resource()
  • _cleanup_running_deleted_instances - Should emit notification
  • _run_image_cache_manager_pass - Should emit notification
  • _run_pending_deletes - Should emit notification
  • _check_instance_build_time - Should emit notification
  • _heal_instance_info_cache - maybe?
  • _poll_rebooting_instances - Might emit notification on anomalies
  • _poll_rescued_instances - Might emit notification on anomalies
  • _poll_unconfirmed_resizes - Might emit notification on anomalies
  • _poll_shelved_instances - meh
  • _instance_usage_audit - maybe, but unlikely
  • _poll_bandwidth_usage - yes
  • _poll_volume_usage - yes
  • _sync_power_states - calls virt.get_info() which has mem/max_mem info ... dunno?
  • _reclaim_queued_deletes - meh

Possible API

class MonitoringPlugin:
def on_cpu(self, cpu_dict):
def on_volume(self, volume_dict):
def on_bandwidth(self, bandwidth_dict):
def on_ram(self, ram_info): # could potentially get called from two places, need to ensure units and source are the same.

What to do with measurements?

  • Send samples to Ceilometer via UDP
  • Emit over/under alerts via notifications.

Configuration / Restarts

  • How to configure plugins?
  • How to update configurations without restarting compute node?