High Availability in Heat
This documents the possible approaches towards implementing High Availability in Heat.
High Availability Overview
High Availability entails monitoring services on the deployed instances and making sure that the services stay running.
There are three layers that we are interested in:
1. Services running inside an instance 2. Individual instances (virtual machines) 3. A logical grouping of instances (stack)
Heat will monitor all three levels. If there is a problem, it will attempt to resolve it on that level. If that fails, the issue will be escalated to a higher level.
For instance, if Heat detects a failure in the database service, it will restart that service. Should the problem persists even after a few restarts, it will restart the instance hosting the database. If even that doesn't help, it will restart the entire stack.
Approach 1: Metadata Server + `cfn-hup`
The basic idea is to combine the CloudFormation instance metadata and the cfn-hup to create a communication layer between the instance and Heat.
`cfn-hup` is a script/daemon that runs inside the instance. It monitors the instance metadata and executes hooks when the metadata changes.
Currently, there are three triggers that `cfn-hup` provides hooks for:
We would extend it to monitor the services specified in the `AWS::CloudFormation::Init.services` metadata and add a custom trigger (`service.fail` or something similar).
`Heat`'s rescue/notification script would then hook into that and do its thing.
The metadata server stores the instance metadata and makes it available for reading and writing both from outside the instance and from within.
Heat would connect to the metadata server and get notified about service failures (and possibly other events). Heat Engine would decide if there is a need to escalate and either restart the instance or the whole stack.
The notification would be done by polling at first. Later on we'd probably switch to push.
For sending the events from the instances, we could extend the `cfn-signal` script.
The huge benefit of this approach is that we'd base it on tools and features that we need to build anyway for CloudFormation compatibility (metadata server, cfn-signal, cfn-hup).
Our extensions to these tools would not break compatibility and would be useful for needs other than HA.
Approach 2: Resource Monitor Alerts and Notifications
Suggested by _asalkeld_, would leverage the proposed _Resource Monitor Alerts and Notifications system_. <http://wiki.openstack.org/ResourceMonitorAlertsandNotifications>
The idea would be to slightly extend this system to allow the guest monitor to stop/start/restart the servers and send events on these actions. Then heat-engine could listen for these notifications and do any father actions (escalation).
The obvious benefit is the good integration with Openstack infrastructure. The downside it does not exist yet.
We will choose option #1 for now and look to integrate in option #2 as it is available. Having both options will be beneficial in the long run.
Items of work:
- [New] cfn-get-metadata
- read the metadata from the server - <http://docs.amazonwebservices.com/AWSCloudFormation/latest/UserGuide/cfn-get-metadata.html>
- [Mod] cfn-init to use cfn-get-metadata
- [New] cfn-hup
- run custom scripts on metadata update - <http://docs.amazonwebservices.com/AWSCloudFormation/latest/UserGuide/cfn-hup.html> - use cfn-get-metadata to get metadata - monitor/recover any services that need monitoring - send events back to metadata server via cfn-signal (or similar)
- [New] metadata server
- can be written to by the guest - produce notifications as a result of changes
- [Mod] heat-engine
- write metadata to metadata server - receive notifications of service state changes - manage escalations.