Leader Election in Nova using ZooKeeper
A common scenario in distributed systems is to 'elect' a singleton node to perform a certain role or function, among a larger group of nodes. When the leader fails, the failure is detected, triggering election of a new leader. In a more general case, instead of choosing a single leader, it might be desired to have certain number of leader (or 'active') nodes at any point of time, while a larger group of nodes (potentially all of them) are capable of performing the job. Such a capability is crucial for a large-scale system that requires built-in high availability -- such as OpenStack fabric.
A leader election service for Nova can be implemented using Zookeeper. A general implementation scheme is described at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection The general flow of the leader election process within a group is as follows:
- Nodes in the group nominate themselves to become a leader
- The service chooses and notifies the leader node
- The service monitors the liveness of the leader (e.g., using ZK-based heartbeat & membership service, described in a separate blueprint)
- When the service stops seeing the leader, election of a new leader is initiated
- At any point in time, nodes can query the service for the current leader, or subscribe for notifications to receive leadership changes
The main methods of the proposed leader election API are:
- getLeader(groupID) -- The method can be called by any client, independent of participation in the leader election. We assume that several independent leader election processes can be maintained in parallel, identified by a groupID.
- nominate(groupID) -- A client nominates itself to be a leader. the method is blocked, and returns only if the client was chosen.
- optional: nominateAsync(groupID, leader_callback). A client nominates itself to be a leader. The method is non-blocking. When the client is chosen, the leader_callback is called.
- optional: unnominateAsync(groupID), An optional method, which is complementary to the non-blocking nominateAsync method.
- optional: subscribeLeaderChanges(groupID, change_callback) -- Optionally, a node can subscribe to receive leadership updates asynchronously, via change_callback.
- optional: unsubscribeLeaderChanges()
- optional: subscribeErrors(groupID, error_callback) -- see Notes below
- Under certain conditions a node may lose the connection to the Leader Election/Heartbeat service (e.g., due to network problems -- potentially causing zero or several nodes to believe they are the leader), in which case the error_callback will be triggered
- Due to distributed nature of the system, processes like failure detection and leader election may take time, hence there might be situations in which "getLeader" will return a node which is no longer alive/active. The higher-level logic should be able to handle such situation (e.g., non-leader nodes should respond with an error if invoked to perform the function).
- It is assumed that node which is alive and has nominated itself to be a leader, can actually perform the function expected from the leader. In some cases an external rejuvenation mechanism maybe considered, that would ensure that there are no nodes which are alive but can not perform the expected function (e.g., due to an internal error condition).
- It is possible (and rather straightforward) to extend this mechanism to more than one leader (or 'active node') in a group