Oslo/blueprints/service-sync

Introduction
Each OpenStack project is composed of several services which cooperate to achieve the wanted task, for instance in Nova when the creation of an instance is requested then the scheduler service performs some checks among the compute services in order to choose which one will host it. We can observe the same behavior in Neutron when a router needs to be scheduled within one of the available l3-agent.

More generally, when there is a lot of interactions and dependencies between services in a distributed system, like OpenStack, then we need an efficient way to coordinate and synchronize them. These coordination primitives have been more or less implemented independently in each project. One of the most common primitive that we encounter in OpenStack is the so called group membership service which aims at centralizing the state of services.

The purpose of this blueprint is to rationalize in a common “coordination” module these efforts through a generic API which could be used by every projects if needed.

Group membership service
Before giving the goal of that service we need to understand the concept of a group: a group is a set of related nodes, for instance the set of nova-compute or the set of neutron l3-agent are valid groups since they provide the same interface.

The goal of a group membership service is to be able to answer, at any moment, the question “Which member of the group is online and ready to respond ?”. The service must, of course, provide a way to join and leave a given group.

In order to achieve that work, this service must keep track the current state of the group by using a monitoring mechanism which targets each member so that when a service join or leave the group then the remaining get notified. The group membership service has also to automatically detect failures so that the remaining nodes can react in consequence, for instance they could reconfigure the work between them.

Python API
A member is completely identified by the tuple (group_id, member_id), those IDs can be stored either in the database or in a configuration file.

This function creates the group named by the group_id parameter, if the group already exist then it has no effects.
 * create_group(group_id)

This function returns the list of all created groups.
 * get_all_group_ids

This function permits to the caller to join the group, if the caller already joined the group then it has no effects. If the group doesn't exist then it raises an exception.
 * join_group(group_id, capabilities)

This function permits to the caller to leave the group, if the caller already leaved the group then it has no effects. If the group doesn't exist then it raises an exception.
 * leave_group(group_id)

This function returns the current members of the group identified by the parameter group_id. A member is online if it belongs to the group. The caller is not necessarily a member of the group. If the group doesn't exist then it raises an exception.
 * get_members(group_id)

This function returns the capabilities of the member identified by (group_id, member_id). The capabilities correspond to an array of bytes which can contains information about the member, it can be very useful to keep track the current configuration of a member. If the group or the member doesn't exist then it raises an exception.
 * get_member_capabilities(group_id, member_id)

I suggest to use a serialization format for the capabilities like Google Protocol Buffers or MessagePack.

This function publish the new capabilities of the caller.
 * update_capabilities(capabilities)

This function register the function “notifier” as the function to call when the configuration of the group “group_id” has changed. For instance when a member join/leave the member.
 * set_notifier_func(group_id, notifier)

This function returns the id of the current leader of the group. A leader is a special member on which all members agreed on.
 * get_leader(group_id)

Nova
In nova there is the module nova.servicegroup which basically implements the group membership of every compute, it provides an API and several implementations. The default one is to use the database on which all members are registered, there is also a Memcached and ZooKeeper implementation available.

This service could be easily replaced by this new Oslo API since they overlap on functionalities. In addition, the scheduler (or any other services if needed) could leverage this library in order to retrieve the configuration of every nova compute services.

Neutron
In neutron there is the report_state function which updates periodically the configuration of the agent into the database. There is no API which abstract this task, it's completely based on the database. The function could be efficiently replaced by this new Oslo API, it will then benefit from several back-end.

Ceilometer
There is one central agent which polls the other services, this solution is very fragile since there is an obvious SPOF (single point of failure). Indeed when the central agent downs then the polling process cannot be performed. In addition, if there is a huge number of services to poll then only one agent cannot scale very well. In order to alleviate the central agent, we need to deploy a cluster of agents and then to split the work between them. The deployment of a cluster implies to manage the nodes, in particularly, we need the following features:

Group membership: an agent must be able to find which ones are online/offline. Task distribution: according to the current cluster state, an agent must be able to find which services to poll.

The first point could be achieved by the Oslo API and the second could be implemented on top of the group membership service (it deserves a blueprint).

However, the task distribution service is a good candidate to incorporate into the Oslo coordination module, it's a point that it should be discussed.

ZooKeeper implementation
ZooKeeper is a tool which help to manage a distributed system. Basically, it contains a small database which is reachable by a ZooKeeper client, the whole synchronization is around its database. It provides several features:
 * It can be used as a distributed lock system.
 * It can be used to elect a master node.
 * It can be used to implement group membership protocol by taking an inventory of a set of node and detect faulty nodes. It has an asynchronous callback mechanism which permits to be notified (and do some actions like balancing the work for instance...) when an event occur, for instance when a node join/leave the cluster. This feature is really interesting since there is no polling process to do for being updating when the cluster state changes.

This tool is a good candidate for these issues, the only drawback i've noticed is that we add a new element in the architecture which can make more or less complex the deployment. For the API please take a look at the Kazoo project which is the recommended python client to use: https://kazoo.readthedocs.org

Serf
Serf provides a lightweight solution for service discovery, based on gossip protocol. http://www.serfdom.io/

Database implementation
We can use the database on which each agent will send timestamped heartbeat to establish group membership. Each node retrieves periodically the list of the current online members, when an agent downs it's detected by using a timeout. This principle is currently implemented in Nova and Neutron. The main advantage of this solution is that it uses legacy tools but it is subject to race conditions and can be quickly (very) hard to debug.

Conclusion
The coordination of a distributed system (like Openstack) is inevitable when the system grows, i think theses kind of problems are sufficiently complex to deserve its own library. It would be more profitable for each projects and future projects to rationalize these efforts into a common place in order to not reinvent the wheel and spend a lot of time to debug. The group membership service is a good starting point to achieve that goal.