Jump to: navigation, search

Distributed Task Management With RPC

Revised on: 3/15/2014 by Harlowja

Why: This document proposes an architecture of distributed flow that can run tasks simultaneously on multiple workers (increasing scalability and reliability). The main goal here is to provide such an architecture that will allow the user to replace a local engine (previously executing with threads for example) with a distributed engine without changing any code. We do not want to make any difference between distributed and non-distributed engine/flow descriptions (making it as transparent as possible to users). The difference should be only in flow engines types. In general a distributed engine should work much the same as a single threaded engine.

Architecture

Definitions

Client
a machine (or program) that runs a distributed flow
Worker
a machine (or program) that executes distributed flows’ tasks by responding to execution requests
Distributed Task
a task execution type that performs a remote procedure call to a worker
Remote Task
a task that runs on the worker side and executes some code to make a flow progress

How

A distributed system consists of a client (potentially many) and workers. A client (the code that has the engine) runs a flow. When the client wants to start a new task it makes RPC call/s to workers and passes client's endpoint and task's arguments. One of the workers accepts the task and sends a confirmation to the client. Then it starts to execute the task and sends heartbeats to the client. The client listens for the workers responses (status updates and so-on) during this period. When the task is done the worker sends a result. The client considers worker as failed if it hadn't been receiving a task status message for a timeout period.

A high-level architecture can be seen in the following image:

Distributed flow with oslo.messaging.rpc.png

Details

Please visit: https://etherpad.openstack.org/p/TaskFlowWorkerBasedEngine