Revision as of 17:59, 18 February 2013

Guru Meditation Reports

When things go wrong in (production) deployments of OpenStack collecting debug data is a key first step in the process of triaging & ultimately resolving the problem. Nova has extensively used logging capabilities which produce a vast amount of data. This does not, however, enable an admin to obtain an accurate view on the current live state of the system. For example, what threads are running, what config parameters are in effect, and more. The eventlet backdoor facility provides an interactive shell interface for any eventlet based process, allowing an admin to telnet to a pre-defined port and execute a variety of commands. This can be used to collect the necessary state information, but is has a number of limitations.

Every service running on a host needs to have the backdoor running on a different TCP port and the admin has to remember which process is listening where. Get this wrong and very bad things can happen.
The backdoor needs to have been enabled when the process was started. If this was not done before the problem arose, the admin is out of luck because restarting the service to enable the backdoor will loose the critical state that was desired.
The backdoor shell is too powerful. By presenting an interactive python shell too much burden is placed on the admin to find the right data, without causing problems.

Error Report Framework

To address the issues described above, this page outlines the design of a general purpose error report generation framework, known as the "guru meditation report" (cf http://en.wikipedia.org/wiki/Guru_Meditation).

Models: These classes define structured data for a variety of interesting pieces of state. For example, stack traces, threads, config parameters, package version info, etc. They are capable of being serialized to XML / JSON or a plain text representation
Generators: These classes are used to populate the model classes with the current runtime state of the system

There will be a number of standard models / generators available for all OpenStack services

StackTraceModel: a base class for any model which includes a stack trace
ThreadModel: a class for information about a thread
ExceptionModel: a class for information about a caught exception
ConfigModel: a class for information about configuration file settings
PackageModel: a class for information about vendor/product/version/package information

Each OpenStack project will have the ability to register further generator classes to provide custom project specific data.

Integration with apps

Every long running service process should have a call to install a signal handler which will trigger the guru meditation framework upon receipt of SIGUSR1. This will result in the process dumping a complete report of its current state to stderr.

For processes which deal with RPC processes, it may also be desirable to install some kind of hook in the RPC request dispatcher that will save a guru meditation report whenever the processing of a request results in an uncaught exception. It could save these reports to a well known directory (/var/log/openstack/<project>/<service>/) for later analysis by the sysadmin or automated bug analysis tools.

@@ Line 5: / Line 5: @@
 * The backdoor needs to have been enabled when the process was started. If this was not done before the problem arose, the admin is out of luck because restarting the service to enable the backdoor will loose the critical state that was desired.
 * The backdoor shell is too powerful. By presenting an interactive python shell too much burden is placed on the admin to find the right data, without causing problems.
+== Error Report Framework ==
+To address the issues described above, this page outlines the design of a general purpose error report generation framework, known as the "guru meditation report" (cf http://en.wikipedia.org/wiki/Guru_Meditation).
+* Models: These classes define structured data for a variety of interesting pieces of state. For example, stack traces, threads, config parameters, package version info, etc. They are capable of being serialized to XML / JSON or a plain text representation
+* Generators: These classes are used to populate the model classes with the current runtime state of the system
+There will be a number of standard models / generators available for all OpenStack services
+* StackTraceModel: a base class for any model which includes a stack trace
+* ThreadModel: a class for information about a thread
+* ExceptionModel: a class for information about a caught exception
+* ConfigModel: a class for information about configuration file settings
+* PackageModel: a class for information about vendor/product/version/package information
+Each OpenStack project will have the ability to register further generator classes to provide custom project specific data.
+== Integration with apps ==
+Every long running service process should have a call to install a signal handler which will trigger the guru meditation framework upon receipt of SIGUSR1. This will result in the process dumping a complete report of its current state to stderr.
+For processes which deal with RPC processes, it may also be desirable to install some kind of hook in the RPC request dispatcher that will save a guru meditation report whenever the processing of a request results in an uncaught exception. It could save these reports to a well known directory (/var/log/openstack/<project>/<service>/) for later analysis by the sysadmin or automated bug analysis tools.
+== Example Report ==

Difference between revisions of "GuruMeditationReport"

Revision as of 17:59, 18 February 2013

Contents

Guru Meditation Reports

Error Report Framework

Integration with apps

Example Report