Distributed Task Management With RPC

Revised on: // by

Why:  This document proposes an architecture of distributed flow that can run tasks simultaneously on multiple workers (increasing scalability and reliability). The main goal here is to provide such an architecture that will allow the user to replace a local engine (previously executing with threads for example) with a distributed engine without changing any code. We do not want to make any difference between distributed and non-distributed engine/flow descriptions (making it as transparent as possible to users). The difference should be only in flow engines types. In general a distributed engine should work much the same as a single threaded engine.

Definitions

 * Client
 * a machine (or program) that runs a distributed flow


 * Worker
 * a machine (or program) that executes distributed flows’ tasks by responding to execution requests


 * Distributed Task
 * a task execution type that performs a remote procedure call to a worker


 * Remote Task
 * a task that runs on the worker side and executes some code to make a flow progress

How
A distributed system consists of a client (potentially many) and workers. A client (the code that has the engine) runs a flow. When the client wants to start a new task it makes RPC call/s to workers and passes client's endpoint and task's arguments. One of the workers accepts the task and sends a confirmation to the client. Then it starts to execute the task and sends heartbeats to the client. The client listens for the workers responses (status updates and so-on) during this period. When the task is done the worker sends a result. The client considers worker as failed if it hadn't been receiving a task status message for a timeout period.

A high-level architecture can be seen in the following image:



Details
Please visit: https://etherpad.openstack.org/p/TaskFlowWorkerBasedEngine