Jump to: navigation, search

Difference between revisions of "Zabbix-agent-adoption"

m (Design and implementation)
m (Feasibility analysis)
Line 56: Line 56:
 
time the proxy is invoked, its working process can be briefed as the following<br />
 
time the proxy is invoked, its working process can be briefed as the following<br />
 
steps:
 
steps:
- Receiving queries from Ceilometer,<br />
+
* Receiving queries from Ceilometer,
- Translating queries into the commands/queries meaningful to those<br />
+
* Translating queries into the commands/queries meaningful to those<br />
3rd-party agents,  
+
3rd-party agents,
- Sending commands/queries (via TCP connections) to an agent in<br />
+
* Sending commands/queries (via TCP connections) to an agent in<br />
 
an instance on the compute node where this proxy is located,
 
an instance on the compute node where this proxy is located,
- Geting data back from the agent,<br />
+
* Geting data back from the agent,<br />
- Transforming data into samples, of which formats are legal for Ceilometer and
+
* Transforming data into samples, of which formats are legal for Ceilometer and
- Returning those samples to Ceilometer at last.
+
* Returning those samples to Ceilometer at last.
  
 
In such a case, a question is whether or not we still need the 3rd-party monitor<br />
 
In such a case, a question is whether or not we still need the 3rd-party monitor<br />

Revision as of 15:08, 1 November 2013

  • Launchpad Entry: CeilometerSpec:Zabbix-agent-adoption
  • Created: Oct. 25, 2013
  • Contributors: Yu Zhang

Introduction

Currently, Ceilometer collects instance data via compute agents installed on every
OpenStack compute nodes. PollingTasks in a compute agent invoke multiple pollsters,
which then call hypervisor-dependent inspectors for metering various metrics. As
an example, the CPUPollster calls the inspect_cpus() method of a hypervisor-dependent
inspector object to get VCPU data. If the hypervisor is KVM, inspect_cpus() calls
the info() method of the virDomain class of libvirt, then returns a list of 5 data elements,
including two CPUPollster cares about: VCPU number and running time.

Such pollsters work well for those data easily available to a hypervisor (http://www.mirantis.com/blog/openstack-metering-using-ceilometer/),
while ignoring detailed and precise guest system metrics which are not provided
by a hypervisor. As a simple case study, we can compare what CPUPollster
provides with those CPU monitoring items supported by Zabbix, one of the most
pupular system monitor tool. A snapshot of Zabbix web console is shown in the
following figure.

CPU monitoring items in Zabbix

In practice those guest system metrics provided by Zabbix are highly valuable for
both OpenStack Admins and tenants, which is verified by our own experiences and
feedback from other companies using OpenStack. Therefore, Zabbix has been
deployed in many product-oriented OpenStack clouds to achieve detailed and precise
monitoring. Other popular 3rd-party monitoring tools include Nagios, Ganglia, etc.

This work aims at leveraging existing monitoring assets and expertises in system
administration teams to the best extent, instead of removing or replacing them with
efforts. An adoption mechanism between 3rd-party monitoring agents in instances
and Ceilometer compute agents in compute nodes is added, therefore Ceilometer
can poll data from those agents directly to enhance its capability of monitoring
instances.

Feasibility analysis

Most 3rd-party monitoring tools are essentially client-server systems. For
each monitored system, an agent (e.g. Zabbix agent, Nagios NRPE, Ganglia
gmond, etc.) is installed. Some monitoring tools can leverage SNMP. In such
cases, we can consider the SNMP deamon in a monitored system as an agent.

To achieve cluster-wide monitoring, monitoring data storage and providing UI
interfaces, there is also a server in each tool, which, directly or via some
low-level utilities (e.g. Nagios check_nrpe), queries those agents in monitored
system periodically and polls data back. For all of Zabbix, Nagios and Ganglia,
such querying and polling are usually conducted via TCP connections between
agents and the server.

Therefore, it is reasonable for us to consider all VM instances on an OpenStack
compute node as a monitored cluster. A 3rd-party monitoring agent in each
instance listens to specified port and, when queries receieved, collects required
data and sends them back. The only difference could now be that, the queries
might be not from a 3rd-party tool monitoring server, but a local proxy which is
a plugin of the Ceilometer compute agent running on this compute node. Each
time the proxy is invoked, its working process can be briefed as the following
steps:

  • Receiving queries from Ceilometer,
  • Translating queries into the commands/queries meaningful to those

3rd-party agents,

  • Sending commands/queries (via TCP connections) to an agent in

an instance on the compute node where this proxy is located,

  • Geting data back from the agent,
  • Transforming data into samples, of which formats are legal for Ceilometer and
  • Returning those samples to Ceilometer at last.

In such a case, a question is whether or not we still need the 3rd-party monitor
server to be deployed. The answer depends on both the design of 3rd party tool
and the extra development efforts we want to afford. Take Zabbix as an example.
All Zabbix agents should be configured and initialized by the Zabbix server at first,
then they can be aware of what types of metrics they should collect, how long the
metering intervals should be, and so on. If this is the case, a deployed Zabbix server
can simply help us to manage all Zabbix agents in instances during the initial stages.
After all agents are set up, we can just use the proxy in Ceilometer to collect data
and the server might not be used quite often. Of course we can develop agent
management functions in Ceilometer (if the protocol is open) to replace the monitoring
server thoroughly, but the extra development efforts might not be ignorable.

If the 3rd-party agents are only loosely coupled with the server, and can be controlled
by simple protocols, then the server will be unnecessary at all.

Design and implementation

To be added