InstrumentationMetricsMonitoring

Launchpad Entry: NovaSpec:nova-instrumentation-metrics-monitoring
Created: 23 Oct 2012
Drafter: Tim Daly Jr, Joshua Harlow, Jeff Budzinski
Drafters Email: [AT yahoo-inc.com], [AT yahoo-inc.com],[AT yahoo-inc.com]

Summary

To effectively operate OpenStack at larger scale, we propose deeper instrumentation and timing of key activities *within* processing daemons at external I/O points and key bits of the processing flow. This will help in monitoring system performance and triaging issues at a deeper level. The proposal is to add a generalized mechanism for measuring processing and I/O events and other key metrics inside the daemons.

Release Note

This section should include a paragraph describing the end-user impact of this change. It is meant to be included in the release notes of the first release in which it is implemented. (Not all of these will actually be included in the release notes, at the release manager's discretion; but writing them is a useful exercise.)

It is mandatory.

Rationale

Our experience has shown there to be value in deeper levels of instrumentation and monitoring in order to aid in tracking scale and availability issues, monitoring intra-service issues, component errors, and for managing component health.

User stories

Examples of what could be done with this instrumentation data:

establish performance baselines and characteristics for acceptance testing
determine scalability characteristics of various system components
tune configuration parameters for installation-specific sizing, e.g. connection pools, greenpools, wsgi backlog, etc.
set alerts on certain types of metrics: connection pool utilization, connection errors, timeouts
perform offline, large scale analysis of system usage patterns and performance using hadoop to do advanced scheduling, prediction, resource balancing

Measurements

API endpoint

request receive time
request receive error
request receive timeout
request receive bytes
response send time
response send error
response send timeout
response send bytes

WSGI

backlog
waits
request processing time
response processing time
dispatch time

Event loop

function time (native/not native)
idle/blocking time

Database

function time
model query time
model write/update time
session establishment time
connection errors
connection count
ping listener errors

RPC

connection pool used
connection pool free
message reply time
message reply errors
pack context time
unpack context time
multicall wait time
cast time
fanout cast time
cast to server time
fanout cast to server time
notify send time
remote errors
rpc timeout

Eventlet

pool used
pool free
greenpool used
greenpool free
waiter count
timeouts

Requirements/Constraints

The solution should add a minimum of overhead whether activated or not.
The solution should *never* compete with command and control message priorities
Emission and collection of data should be compatible with existing agents and pluggable where practical.
Aggregation/correlation should be separate from data emission. Different tastes in collection and analysis should be supported.
Data transmission is best effort and ok to be lossy in some scenarios.
Desirable to have different levels of instrumentation since some do not want to go as deep as others
Desirable to have ability to aggregate stats on 'dimensions', e.g. region, zone, tenant, etc.
We do not want this data going via RPC since it should never interfere or compete for resources with RPC-driven operations.

Design

The current plan is to:

Create a set of decorators to wrap functions for the purpose of measuring execution time, emitting numeric counts, and raw events.
Extend the nova logger to create a distinct log level and log handler to divert metrics to a different data sink
Use the decorators to create metrics for some subset of nova
Create some examples of metric aggregation using statsd via datagram and also via batch log analysis using hadoop.

We will flesh out this with more details as we complete our prototype work. Here is a sketch for discussion:

(see attachments for graffle and visio xml versions)

Notes:

Eventlet backdoor: https://github.com/openstack/openstack-common/blob/7695f967/openstack/common/eventlet_backdoor.py
Grizzly Design Summit etherpad @ https://etherpad.openstack.org/grizzly-common-instrumentation

Implementation

Code Changes

add metrics gauges/decorators to nova/common
add METRIC log level, metric format, configurable metric handler to nova/log.py
instrument a couple of key modules to start

UI Changes

None at this time.

Migration

N/A

Test/Demo Plan

This need not be added or completed until the specification is nearing beta.

Unresolved issues

Leveraging ceilometer: we certainly don't want to carry this data via RPC but may want to leverage log agent and collector.
Compatibility with stacktach (see https://github.com/rackspace/stacktach and http://www.sandywalsh.com/2012/09/openstack-nova-internals-pt2-services.html)
Consideration/evolution of https://blueprints.launchpad.net/nova/+spec/nova-instrumentation-v1 and impacted code if it gets approved

Leveraging ceilometer

on the one hand, there are clear similarities between things being measured by ceilometer and monitoring data
BUT, ceilometer was not built for monitoring; it was built for metering and NOT losing critical billing messages
AND, putting a bunch of best effort delivery messages through ceilometer and the RPC fabric does not seem to make sense
possible to utilize ceilometer agents and service but with lighterweight transport?

Leveraging loggers

good news is code is already well-covered with logger objects and we are very likely to want to instrument at many levels: per-request, periodic, high-level, low-level
injecting instrumentation data into the logging stream would be relatively straightforward using a metrics log adapter
filter would be added to select only metrics events
an additional logging handler could be created to take stuff out of the stream and emit it over the net, e.g. using DatagramHandler
TBD: understand performance implications of utilizing log stream

Making it low overhead

instrumented code must be cheap/free when inactive
could possibly be handled via macros or preprocessing. kind of a pain though.
technologies to consider:
- http://pypi.python.org/pypi/MetaPython
- http://code.google.com/p/pypreprocessor
deepest level of instrumentation could be if debug: and optimized away with -O
don't want to flood the network with datagram
consolidate request metrics into single event
batch send

How does this fit with stacktach

Stacktach starts with dequeuing from AMQP so that doesn’t fit with the desire to not put this stuff over queue-based RPC
BUT, there is clear overlap here since stacktach seems to be design to collect timing for things of interest. Perhaps the answer is that for measuring RPC flow, stacktach and instrumentation are not mutually exclusive?
notes on stacktach:
Tach - monkey patching library
Used monkey patching to avoid ugly-ifying the code
Monkey patching the RPC code? via config of nova.compute.queue_receive, method-by-method
Only patch calls
Decorators to catch/emit on exception
configurable notifier (e.g. statsd)
Essentially wrappers functions and does UDP to statsd upon RPC call
have another set of stachtach workers that listen to queue and do rest calls up to StackTach for insertion into db (this is the v1 implementation). multitenant for devs. but has perf issues.
v2 writes to the db directly but having troubles with perf on this one
django app gives your a view of recent activity (notifications)
also has cli to interrogate the REST-based i/f to stacktach
suitable for production? seems to be used in rackspace prod envs
traceability by uuid (nice) and perhaps request id
can also get metrics: count, min, max, avg for the instrumented events (e.g. compute.instance.shutdown, compute.instance.delete, compute.instance.reboot, etc.)
looks at request start/end request id pairs to compute times
statsd choice: superfast, udp, no black holes

Meeting Logs

InstrumentationMetricsMonitoring10292012 - IRC Meeting Log 10/29/2012

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.