Zabbix-agent-adoption


 * Launchpad Entry: CeilometerSpec:Zabbix-agent-adoption
 * Created: Oct. 25, 2013
 * Contributors: Yu Zhang

Blueprint link: https://blueprints.launchpad.net/ceilometer/+spec/zabbix-agent-adoption

Note: The blueprint has been renamed as "3rd-party monitoring agent

adoption mechanism" to be aligned with its current contents.

Introduction
Currently, Ceilometer collects instance data via compute agents installed on every

OpenStack compute nodes. PollingTasks in a compute agent invoke multiple pollsters,

which then call hypervisor-dependent inspectors for metering various metrics. As

an example, the CPUPollster calls the inspect_cpus method of a hypervisor-dependent

inspector object to get VCPU data. If the hypervisor is KVM, inspect_cpus calls

the info method of the virDomain class of libvirt, then returns a list of 5 data elements,

including two CPUPollster cares about: VCPU number and running time.

Such pollsters work well for those data easily available to a hypervisor (http://www.mirantis.com/blog/openstack-metering-using-ceilometer/),

while ignoring detailed and precise guest system metrics which are not provided

by a hypervisor. As a simple case study, we can compare what CPUPollster

provides with those CPU monitoring items supported by Zabbix, one of the most

pupular system monitor tool. A snapshot of Zabbix web console is shown in the

following figure.



In practice those guest system metrics provided by Zabbix are highly valuable for

both OpenStack Admins and tenants, which is verified by our own experiences and

feedback from other companies using OpenStack. Therefore, Zabbix has been

deployed in many product-oriented OpenStack clouds to achieve detailed and precise

monitoring. Other popular 3rd-party monitoring tools include Nagios, Ganglia, etc.

This work aims at leveraging existing monitoring assets and expertises in system

administration teams to the best extent, instead of removing or replacing them with

efforts. An adoption mechanism between 3rd-party monitoring agents in instances

and Ceilometer compute agents in compute nodes is added, therefore Ceilometer

can poll data from those agents directly to enhance its capability of monitoring

instances.

Feasibility analysis
Most 3rd-party monitoring tools are essentially client-server systems. For

each monitored system, an agent (e.g. Zabbix agent, Nagios NRPE, Ganglia

gmond, etc.) is installed. Some monitoring tools can leverage SNMP. In such

cases, we can consider the SNMP deamon in a monitored system as an agent.

To achieve cluster-wide monitoring, monitoring data storage and providing UI

interfaces, there is also a server in each tool, which, directly or via some

low-level utilities (e.g. Nagios check_nrpe), queries those agents in monitored

system periodically and polls data back. For all of Zabbix, Nagios and Ganglia,

such querying and polling are usually conducted via TCP connections between

agents and the server.

Therefore, it is reasonable for us to consider all VM instances on an OpenStack

compute node as a monitored cluster. A 3rd-party monitoring agent in each

instance listens to specified port and, when queries receieved, collects required

data and sends them back. The only difference could now be that, the queries

might be not from a 3rd-party tool monitoring server, but a local proxy which is

a plugin of the Ceilometer compute agent running on this compute node. Each

time the proxy is invoked, its working process can be briefed as the following

steps:
 * Receiving queries from Ceilometer,
 * Translating queries into the commands/queries meaningful to those

3rd-party agents,
 * Sending commands/queries (via TCP connections) to an agent in

an instance on the compute node where this proxy is located,
 * Geting data back from the agent,
 * Transforming data into samples, of which formats are legal for Ceilometer and
 * Returning those samples to Ceilometer at last.

In such a case, a question is whether or not we still need the 3rd-party monitor

server to be deployed. The answer depends on both the design of 3rd party tool

and the extra development efforts we want to afford. Take Zabbix as an example.

All Zabbix agents should be configured and initialized by the Zabbix server at first,

then they can be aware of what types of metrics they should collect, how long the

metering intervals should be, and so on. If this is the case, a deployed Zabbix server

can simply help us to manage all Zabbix agents in instances during the initial stages.

After all agents are set up, we can just use the proxy in Ceilometer to collect data

and the server might not be used quite often. Of course we can develop agent

management functions in Ceilometer (if the protocol is open) to replace the monitoring

server thoroughly, but the extra development efforts might not be ignorable.

If the 3rd-party agents are only loosely coupled with the server, and can be controlled

by simple protocols, then the server will be unnecessary at all.

The following figure briefs the logical structure of an OpenStack compute node

involving both instances with 3rd-party agents inside and a Ceilometer compute agent

with a proxy.



Design and implementation
The internal mechanism of Ceilometer compute agent is briefed in the following figure.



As shown in the figure, a list of PollingTasks, each of which has its own

execution interval, are invoked periodically inside of the compute agent.

When invoked, a PollingTask will trigger each of its pollsters for each

instance on this compute node for polling data. Thanks to the highly

agile structure design of Ceilometer, all pollsters are in fact plugins, which

can be easily added into this framework. Therefore, to implement our proxy

for 3rd-party monitoring agents, we can just add a ProxyPollster plugin into

the pollster list. Then the ProxyPollster will be invoked periodically and

collect data from each 3rd-party agent in each instance.

For the detailed implementation of the ProxyPollster, two methods can be used:


 * The first is just a in-pollster client of the 3rd-party agent communication

protocol. This works for the cases in which the protocol is quite simple and

easy to implement. As an example, for Zabbix, such a client in Python is

already available. Nagios NRPE client in PERL is also introduced.


 * The second is calling a command-line utility provided by the 3rd-party

monitoring tool itself. The PorxyPollster only calls this utility and waits for

returned data. This method removes most of re-development efforts, but

introduces the requirement of installing the commandline utility on each

OpenStack compute nodes. All of Zabbix, Nagios and Ganglia provide such

a utility for use.

As mentioned, if we do not want to involve the 3rd-party monitoring server to

configure/initialize monitoring agents, we need to rely on a local config file for

describing at least what types of metrics should be collected.

A more-detailed internal structure design of ProxyPollster to be added here.