- Launchpad Entry: NovaSpec:audit-logging
- Created: 2010-11-04
- Contributors: termie, toddwiley
The current state of logging within Nova is largely targetted at debugging and very much relies on a human reading and interpretting the output, lowering the effectiveness of any analysis.
This proposal suggests a more formalized and verbose system of logging targetted at machine-readability with the following sub-goals:
- performance profiling
- error tracking
- system usage statistics
- security auditing
Also suggested is the creation / usage of log analysis tools to provide useful information based on this new data, with the notable mention of the recently open-sourced Sawzall toolset as a good candidate.
As of this release logging has become much more verbose to provide information for statistical analysis of an active cluster. There are additionally a variety of new flags available to disable the new types of logging, and a couple helpers to assist in getting more human-readable output for debugging purposes.
The new (self-explanatory) flags are:
The new helpers are:
- bin/nova-log-format, to provide custom formatted log lines, defaults to summary output
- bin/nova-log-expand, to expand any complex objects in log lines, usually used after you have grepped the log for the lines you care about.
Additionally, the following log analysis scripts have been provided to crunch stats on your cluster, for usage information and output view the scripts themselves:
To accurately guage performance, usage and have an audit trail both of actions taken by any given user more verbose logging needs to begin taking place.
The current system relies more on intuition than hard data and for a production hosting environment intuition is not enough, we need to know our actual scaling characteristics and be able to trace security breaches.
Logging is not the whole story but it is an integral part for being able to manage after-the-fact information and providing statistical data over time for sophisticated analysis.
- Load spike
- Scheduling bullies
- 7-day actives
- Largest users
- Hourly activity graph
- Downtime Post-mortem
- Unauthorized instance
- Crashing instances
- Exceptions per second before and after update
- Rolling upgrade distribution lag
- Log rotation and/or centralized storage/access is handled outside of the scope of this document.
- Logging overhead does not create a significant drop in performance.
At the highest level this is about adding complex data to logs, at a lower one it is a formalization of the process used by apps to log. Somewhere in between it is about wrapping methods with profiling code.
For the sake of simplistic compatibility, JSON is a decent contender. Some anecdotal evidence ( http://blog.juma.me.uk/2009/03/25/json-serializationdeserialization-faster-than-protocol-buffers/ ) suggests JSON is on par with protobufs for serialization speed, and while I wouldn't trust those numbers too far it seems reasonable to say that we can probably dump to JSON roughly as quickly as any other format for complex data.
Common logging keys would be:
- Context (request context)
- User who initiated action
- Action user attempted to initiate
- Current service
- Current rpc (maybe this is part of context)
- Freeform Message
Taking the time to sort the keys alphabetically might make it easier for people to make quick hacky tools to get the data they want for some ad-hoc query idea.
The Logging Process
For most "human-generated" statements, those log lines that are just messages from developers to themselves that developers will make in day-to-day situations while working on the code the only real difference will be including 'context' in the statement, and even that could possibly be inferred by stack inspection.
A custom wrapper for the default logging utility that is service-aware, outputs in the desired log format and a line of boilerplate on importing it should be enough for day-to-day use.
The machine-generated logging would use the same system but would rarely need to be touched, it would wrap high-level entry points into the system to do profiling of:
- overall call performance, how long from initiating an rpc until it has returned
- db performance, for each db layer call how long it takes to return
- uncaught exceptions at the highest level
- requests handled
- number of active X where X is a VM, a network config, or the like
Implementation of outputting the logs is straightforward and is mostly a find and replace mission to make sure each module includes the boilerplate for the custom logging wrapper, and each call to a logging statement includes the additional context information.
At this time it appears that the lp:~anso/nova/deploy branch has much of this part already handled.
From there the logging output needs to be configured to use the desired logging format, and timing wrappers need to be placed at the appropriate entry points.
The two helpers are easy enough to write, the format string input for nova-log-sformat can be simple python formatting strings ( %(timestamp)s %(level)s %(message)s ) and the expander will basically be prettyprint.
This stuff is more tricky and probably will wait until the logging itself is implemented, but again the suggestion of Sawzall may be worth the learning curve.
Testing and demoing will basically be running the tools against dumps of logs, possibly logs from testing though real logs would be preferrable, and documenting some example output.
- Should we magically infer "context" from the stack?
BoF agenda and discussion
Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.