AuditLogging

Launchpad Entry: NovaSpec:audit-logging
Created: 2010-11-04
Contributors: termie, toddwiley

Summary

The current state of logging within Nova is largely targetted at debugging and very much relies on a human reading and interpretting the output, lowering the effectiveness of any analysis.

This proposal suggests a more formalized and verbose system of logging targetted at machine-readability with the following sub-goals:

performance profiling
error tracking
system usage statistics
security auditing

Also suggested is the creation / usage of log analysis tools to provide useful information based on this new data, with the notable mention of the recently open-sourced Sawzall toolset as a good candidate.

Release Note

As of this release logging has become much more verbose to provide information for statistical analysis of an active cluster. There are additionally a variety of new flags available to disable the new types of logging, and a couple helpers to assist in getting more human-readable output for debugging purposes.

The new (self-explanatory) flags are:

--[no]rpc_performance_logging
--[no]db_performance_logging
--[no]exception_logging
--[no]request_logging

The new helpers are:

bin/nova-log-format, to provide custom formatted log lines, defaults to summary output
bin/nova-log-expand, to expand any complex objects in log lines, usually used after you have grepped the log for the lines you care about.

Additionally, the following log analysis scripts have been provided to crunch stats on your cluster, for usage information and output view the scripts themselves:

tools/user_demographics.sh
tools/load_graphs.sh
tools/exception_counts.sh
tools/performance_graphs.sh

Rationale

To accurately guage performance, usage and have an audit trail both of actions taken by any given user more verbose logging needs to begin taking place.

The current system relies more on intuition than hard data and for a production hosting environment intuition is not enough, we need to know our actual scaling characteristics and be able to trace security breaches.

Logging is not the whole story but it is an integral part for being able to manage after-the-fact information and providing statistical data over time for sophisticated analysis.

User stories

Johnny Slowqueries

Utilization
Load spike
Scheduling bullies

...

Megan Managerpants

7-day actives
Largest users
Hourly activity graph

...

Sally Systems

Downtime Post-mortem
Unauthorized instance
Crashing instances

...

Danny Rollout

Exceptions per second before and after update
Rolling upgrade distribution lag

...

Assumptions

Log rotation and/or centralized storage/access is handled outside of the scope of this document.
Logging overhead does not create a significant drop in performance.

Design

At the highest level this is about adding complex data to logs, at a lower one it is a formalization of the process used by apps to log. Somewhere in between it is about wrapping methods with profiling code.

Log Format

For the sake of simplistic compatibility, JSON is a decent contender. Some anecdotal evidence ( http://blog.juma.me.uk/2009/03/25/json-serializationdeserialization-faster-than-protocol-buffers/ ) suggests JSON is on par with protobufs for serialization speed, and while I wouldn't trust those numbers too far it seems reasonable to say that we can probably dump to JSON roughly as quickly as any other format for complex data.

Common logging keys would be:

Timestamp
Context (request context)
- User who initiated action
- Action user attempted to initiate
Current service
Current rpc (maybe this is part of context)
Freeform Message
sys.exc_info

Taking the time to sort the keys alphabetically might make it easier for people to make quick hacky tools to get the data they want for some ad-hoc query idea.

The Logging Process

For most "human-generated" statements, those log lines that are just messages from developers to themselves that developers will make in day-to-day situations while working on the code the only real difference will be including 'context' in the statement, and even that could possibly be inferred by stack inspection.

A custom wrapper for the default logging utility that is service-aware, outputs in the desired log format and a line of boilerplate on importing it should be enough for day-to-day use.

Profiling

The machine-generated logging would use the same system but would rarely need to be touched, it would wrap high-level entry points into the system to do profiling of:

overall call performance, how long from initiating an rpc until it has returned
db performance, for each db layer call how long it takes to return
uncaught exceptions at the highest level
requests handled
number of active X where X is a VM, a network config, or the like

Implementation

Log Generation

Implementation of outputting the logs is straightforward and is mostly a find and replace mission to make sure each module includes the boilerplate for the custom logging wrapper, and each call to a logging statement includes the additional context information.

At this time it appears that the lp:~anso/nova/deploy branch has much of this part already handled.

From there the logging output needs to be configured to use the desired logging format, and timing wrappers need to be placed at the appropriate entry points.

Tools

The two helpers are easy enough to write, the format string input for nova-log-sformat can be simple python formatting strings ( %(timestamp)s %(level)s %(message)s ) and the expander will basically be prettyprint.

Analysis Tools

This stuff is more tricky and probably will wait until the logging itself is implemented, but again the suggestion of Sawzall may be worth the learning curve.

Test/Demo Plan

Testing and demoing will basically be running the tools against dumps of logs, possibly logs from testing though real logs would be preferrable, and documenting some example output.

Unresolved issues

Should we magically infer "context" from the stack?

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.