Libvirt has the ability to configure a watchdog device for KVM / QEMU guests. This device can be used by the guest OS to automatically trigger some action when the guest OS hangs/crashes. There are a variety of actions supported by libvirt / KVM
- 'reset' — default, forcefully reset the guest
- 'shutdown' — gracefully shutdown the guest
- 'poweroff' — forcefully power off the guest
- 'pause' — pause the guest
- 'none' — do nothing
- 'dump' — automatically dump core of the guest
The 'shutdown' action is not recommended, since if watchdog has triggered, it is exceedingly unlikely that the guest will actually be able todo a graceful shutdown. 'dump' is probably not relevant for OpenStack, since cloud admins won't care for analysing core dumps of user's OS, and user's won't/don't have any way to access core dumps. The 'none' action is only useful if there is also some way to notify the user that a watchdog has triggered (eg an email alert).
Thus for OpenStack the actions that make sense are probably 'reset', 'poweroff', 'pause' and 'none'.
KVM has a choice of two watchdog devices, but in reality only the PCI i6300esb device makes sense, since the alternative is a legacy ISA bus device.
It is thought that initially the instance flavours should define whether a watchdog device is provided for a guest or not. A new attribute against the flavour object is imagined
where <setting> is one of
- disabled - no watchdog device
- poweroff - watchdog device + power off when triggering
- reset - watchdog device + reset when triggering
- pause - watchdog device + pause when triggering
- none - watchdog device + no action when triggering
Although flavours would define the default watchdog behaviour, it may be desirable to allow this to be overridden on a per-image basis using a property in glance. For simplicity this would take the same name + settings as the flavour attribute. So for example
# glance image-update \ --property watchdog=poweroff \ f16-x86_64-openstack-sda
In addition to having an action configured for the guest, it is desirable to have a way to notify the owner of the instance that the watchdog has fired. Libvirt will emit an event whenever a watchdog device triggers. The compute driver can listen for these events and feed them back to the compute manager. This could take some action to get an alert back to the user. How this would be done is outside the scope of this blueprint. There is a need for some kind of generic mechanism to get notifications back to the end user. Once that exists, the watchdog can be made to use it.