ScientificWGMonitoringAndTelemetry

Context
Telemetry and monitoring for use cases specific to research computing.

Monitoring for Research Computing
Following discussions within the working group, various applications have been identified for monitoring infrastructure, and various solutions are applied:


 * Accounting for resource usage. Typically accounting data is aggregated for project/departmental resource accounting.  Sometimes for enforceable quotas.  Sometimes for billing. Ceilometer, Gnocchi and CloudKitty gather data that is useful for accounting and billing.  WG members advised against using Ceilometer at scale due to long query times and high load on Keystone.  The combination of Gnocchi and Influx was found to work well at NeCTAR.  Note that the free version of InfluxDB does not support HA clustering.  A new project on github enables direct integration of Collectd with Gnocchi.
 * Troubleshooting OpenStack control plane issues and failures. Ceilometer/Gnocchi is less helpful here because it uses the OpenStack RabbitMQ (or equivalent) as transport for metrics.  It's likely to be affected if there is a control plane failure involving Rabbit.  Monasca has a dependency on Keystone, but is otherwise decoupled from the OpenStack control plane it is monitoring.  A fork of Nagios called Naemon is used for stack health monitoring at Pittsburgh Supercomputer Center.  The Chameleon project use NAGIOS plugins for OpenStack.


 * Monitoring hardware health and failures. Cambridge University use Dell OpenManage, integrated with a site-wide deployment of NAGIOS fork Icinga.


 * Understanding network performance issues and network failures. Mellanox NEO provides an API for extracting telemetry data but currently none of the WG are using this.  Cambridge University use a fork of Observium called LibreNMS for SNMP network monitoring.


 * Telemetry services for user applications. The OVIS project from Sandia provides HPC performance telemetry.


 * Correlating performance telemetry with research computing workloads. Answering the question: "Why did my job run slow?".


 * Monitoring system security events. Collecting ssh log messages and distilling them into monitoring events for tracking activity of ssh users.  Monasca's integrated support of logging, telemetry and events enables this.


 * Tracking user activity to identify live/dormant accounts.


 * Integrating with existing research computing monitoring infrastructure where possible. At Los Alamos Zenoss is used (although not currently for monitoring their private cloud infrastructure).  Indiana University described coercion of OpenStack "instance exists" events into HPC-style jobs, enabling usage reporting using existing resource accounting software to account for cloud instances.

People
Driving the activity area:


 * Stig Telfer (oneswig)
 * Martial Michel (martial)
 * Blair Bethwaite (b1airo)
 * Pierre Riteau (priteau)

Resources

 * Etherpad gathering WG experiences
 * First discussion on IRC
 * Second discussion on IRC

NIST DMoni Project
NIST are developing an in-house monitoring solution relying on 'psutil' for local host aggreation (follow filiation) and elasticsearch central model. In parallel, aggregation of per host global resources using Ganglia. One host for aggregation of information. Advantage is we have multiple sources of input (local nodes are time synchronized), and the tool developed allow for messages ("tick" information per example) to be recorded (for example: processed 10,000 steps). Able to compare per process and per host resources usage on a time scale and compare multiple runs at the "steps" comparable (for distributed programs that match that model).

The NIST DMoni (Distributed Monitoring) tool is able to collect both systems and processes resource usages (e.g. CPU, memory, IO) of a cluster. It consists of several components:
 * Agents: a deamon on each cluster node: collecting system or processes metrics of a single node (using a Python library named psutil for now); pushing collected data to a "central" database;
 * A manager: telling the agents what application or processes to monitor, e.g. a Hadoop application;
 * ElasticSearch (document based database): a "central" database storing collected metrics;
 * Aggregator (in the future): aggreating collected metrics and generate more useful metrics;
 * Kibana (visualization tool): visualizing metrics.

DMoni's advantages:


 * Target-oriented and less overhead: Instead of collecting all the processes' metrics, only monitoring interesting processes (e.g.cluster wide Hadoop processes) and system metrics.