Jump to: navigation, search

Difference between revisions of "HypervisorMonitoringPlugin"

m (Periodic Tasks in compute.manager)
m (Plugin Interface)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
Related to [https://blueprints.launchpad.net/nova/+spec/host-metric-hook this blueprint]
 +
 
The compute.manager has a large number of periodic tasks that collect data about the running host. In large deployments it would be unwise to emit this data for every instance on the host as it would quickly saturate the system. Instead we are proposing to provide a means where an in-service plugin can be called to process the collected data locally (and in-memory) and only report on exceptional cases.
 
The compute.manager has a large number of periodic tasks that collect data about the running host. In large deployments it would be unwise to emit this data for every instance on the host as it would quickly saturate the system. Instead we are proposing to provide a means where an in-service plugin can be called to process the collected data locally (and in-memory) and only report on exceptional cases.
  
Line 26: Line 28:
 
* _reclaim_queued_deletes - meh
 
* _reclaim_queued_deletes - meh
  
=== Possible API ===
+
=== Plugin Interface ===
:class MonitoringPlugin:
+
You don't need to override all of these methods. They'll only get called if they exist in the plugin.
::def on_cpu(self, cpu_dict):
+
 
::def on_volume(self, volume_dict):
+
<nowiki>
::def on_bandwidth(self, bandwidth_dict):
+
class MetricPlugin(object):
::def on_ram(self, ram_info): # could potentially get called from two places, need to ensure units and source are the same.
+
    """Abstract base class for Metric Plugins.⋅
 +
 
 +
    See bp: host-metric-hook for more information.
 +
    """
 +
    def on_host(self, host_dict):
 +
        """Called after hypervisor host stats polled.
 +
        """
 +
        pass
 +
 
 +
    def on_cpu(self, cpu_dict):
 +
        """Called after guest cpu stats polled.
 +
            Unsupported currently.
 +
        """
 +
        pass
 +
 
 +
    def on_ram(self, host_dict):
 +
        """Called after guest ram stats polled.
 +
            Unsupported currently.
 +
        """
 +
        pass
 +
 
 +
    def on_volume(self, volume_dict):
 +
        """Called after guest volume stat polls.
 +
        """
 +
        pass
 +
 
 +
    def on_bandwidth(self, bandwidth_dict):
 +
        """Called after guest network bandwidth stat polls.
 +
        """
 +
        pass</nowiki>
  
 
=== What to do with measurements? ===
 
=== What to do with measurements? ===

Latest revision as of 17:14, 9 October 2013

Related to this blueprint

The compute.manager has a large number of periodic tasks that collect data about the running host. In large deployments it would be unwise to emit this data for every instance on the host as it would quickly saturate the system. Instead we are proposing to provide a means where an in-service plugin can be called to process the collected data locally (and in-memory) and only report on exceptional cases.

The typical use-case for this would be QoS and Alarming. If we were to see customer that has a 5Gb pipe has been running at 20Gb, we'd like to catch that early. Likewise, if a customer is running at 100% CPU on a 4 core image, that should be reported. Alternatively, the plugin could be as simple as taking this collected data and emitting it to a reporting tool like statsd/graphite via UDP.

The biggest challenge here is finding a way to update the plugin's configuration after the compute node has started. We don't want to restart the compute node everytime a high-watermark is moved.

Periodic Tasks in compute.manager

(the notes about emitting notifications are just for future reference)

  • update_available_resource() - yes
  • potentially also call with resource_tracker information that includes requested resources (not just what hypervisor is reporting)
  • actually called in resource_tracker.update_available_resource()
  • _cleanup_running_deleted_instances - Should emit notification
  • _run_image_cache_manager_pass - Should emit notification
  • _run_pending_deletes - Should emit notification
  • _check_instance_build_time - Should emit notification
  • _heal_instance_info_cache - maybe?
  • _poll_rebooting_instances - Might emit notification on anomalies
  • _poll_rescued_instances - Might emit notification on anomalies
  • _poll_unconfirmed_resizes - Might emit notification on anomalies
  • _poll_shelved_instances - meh
  • _instance_usage_audit - maybe, but unlikely
  • _poll_bandwidth_usage - yes
  • _poll_volume_usage - yes
  • _sync_power_states - calls virt.get_info() which has mem/max_mem info ... dunno?
  • _reclaim_queued_deletes - meh

Plugin Interface

You don't need to override all of these methods. They'll only get called if they exist in the plugin.

class MetricPlugin(object):
    """Abstract base class for Metric Plugins.⋅

    See bp: host-metric-hook for more information.
    """
    def on_host(self, host_dict):⋅
        """Called after hypervisor host stats polled.
        """
        pass

    def on_cpu(self, cpu_dict):⋅
        """Called after guest cpu stats polled.
            Unsupported currently.
        """
        pass

    def on_ram(self, host_dict):⋅
        """Called after guest ram stats polled.
            Unsupported currently.
        """
        pass

    def on_volume(self, volume_dict):⋅
        """Called after guest volume stat polls.
        """
        pass

    def on_bandwidth(self, bandwidth_dict):
        """Called after guest network bandwidth stat polls.
        """
        pass

What to do with measurements?

  • Send samples to Ceilometer via UDP
  • Emit over/under alerts via notifications.

Configuration / Restarts

  • How to configure plugins?
  • How to update configurations without restarting compute node?