Ceilometer/blueprints/tasks-distribution

= Task distribution service for Ceilometer central agent =

Introduction
Currently there is one central agent which polls the other services, this architecture is very fragile since there is an obvious SPOF (single point of failure). Indeed when the central agent downs then the polling process cannot be performed. In addition, if there is a huge number of services to poll then only one agent cannot scale very well.

We would like to have multiple central agents running at the same time, and to be able to divide the work amongst them. So, if we have n agents and p resources to poll from, we want to distribute these p tasks onto those n nodes.

In order to coordinate the cluster of agents we split the solution in two main parts. The first one is about the group membership service which aims at creating groups of nodes and notifying the remaining ones in case of failure. The second one is about the task distribution service, once we have a group of agents we want to distribute the tasks onto that group.

The purpose of this blueprint is to propose an architecture to dynamically load balance the tasks onto a given group by leveraging the group membership service.

Task distribution service
This service will be built on top of the group membership service, indeed the idea is to create and monitor two groups. The first one is the group of central agents “/agents” and the second is the group of resources “/resources”.

Once we are able to monitor those two groups we need a mechanism which will assign tasks to agents. In order to perform this task we use the notion of a leader in a group, a leader is a special member on which all members agreed on. In our use case, we will request the group membership service to elect a leader in the central agents group, this leader will be in charge of assigning tasks to the other agents.

Here is an overview of the architecture: The leader will listen to any state changes of the agents group and the resources group. Upon the failure of an agent, which is not the leader, the leader will re-assign tasks to the remaining accord to scheduling policies. We can proceed in the same way for the resource group so that when some resources fail then the leader is notified and can re-assign tasks.

As any other service, the agent which is currently the leader can fail, so in order to make it highly available we need to be able to dynamically re-elect a new leader in case of failures. This is part of the group membership service.

We don’t have real-time constraints for the polling task, so when an agent fail just before polling resources we just resume the task onto the remaining agents, it’s not an issue to delay a little bit the polling of a resource.

Python API
This function permits to join the group of resources and send heartbeats in order to establish group membership. The resource will publish its address as part of its capabilities.
 * resource_join_group(resources_group_id)

This function permits to join the group of agents, after joining the group it will accept resources to poll. This process will also participate to the leader election, in case it’s elected it will create a process which will assume the leader role.
 * agent_join_group(agents_group_id)

Resource algorithm
A resource simply send heartbeats, this is automatically managed by the group membership service when joining a group.

Agent algorithm
The agent is more complicated, it is composed of two processes, the first is the poller process which poll resources and accept new resources to poll and the second is the leader process which assign tasks to agents.

Poller process

variables: list_resources = [ ] # list of resources address to poll is_leader = false

add resource address in list_resources election_leader if current agent elected then is_leader = true notify_leader_process
 * Upon reception of a new resource address:
 * Upon leader failure:

Leader process

function schedule_resources_to_agents: This function assign tasks to agents according to the scheduling policy.

schedule_resources_to_agents
 * Upon state change on agent group:

schedule_resources_to_agents
 * Upon state change on resource group:

Scheduling engine
We need to define a scheduling policy: an algorithm that distributes p tasks from a given set of tasks onto n nodes from a given set of nodes. They could be really simple policies: here are a few examples.

policy.
 * Round-Robin: the tasks are distributed onto the nodes using a "round-robin"

some work, it picks up a task from this FIFO.
 * Single task queue: the tasks are stored in a FIFO. When an agent needs to do

they pop the next task to run. If an agent has no work left (its FIFO is empty), it can 'steal' a task from another agent.
 * Work-stealing policy: each central agent has its own FIFO of tasks, from which