Convection

NOTICE: Similar project -> Mistral
Leveraging some of the ideas of the proposal here for Convection, at the Icehouse design summit in Hong Kong in the Fall of 2013, a project called Mistral. Active ongoing work by a few OpenStack contributors has begun on this project. The proposal here should remain as ideas to reference. For new ideas, it may be beneficial to collaborate with project Mistral: https://wiki.openstack.org/wiki/Mistral

PROPOSAL ONLY: TaskSystem-as-a-Service (Convection)
Please note that this is a PROPOSAL ONLY. Please refer to Mistral project which started in October 2013 and aims to implement the ideas from this proposal and even more. Nova's Requirements Etherpad:

What is Convection
Convection is a proposal for a new open sourced TaskSystem-as-a-Service project for cloud workloads. (NOTE: Some may consider this a Workflow-as-a-Service System when compared to similar offers from other cloud vendors, however, the term Task System more accurately reflects the intentions of this service than a Workflow which is often thought of in terms of Business Process Management which may include both automated and manual complex flows across multiple organizations and systems within a business). Convection could be a public facing API service that provides  capabilities, enabling OpenStack API consumers to build complex multi-step applications running on an OpenStack cloud which could be a public cloud, private cloud, or a hybrid cloud. Convection could also be a service that other OpenStack projects leverage to perform work. e.g. One possible method for Heat to perform orchestration of standing up cloud stacks could be to leverage a Task Service for the steps of spinning up and connecting cloud resources. Conversely, customers wanting to run meta-task-flows could leverage Heat as one task where orchestration of a stack is a single task in the larger meta-task-flow.

Why the name Convection?
Convection was a name proposed by Tim Simpson (Trove developer). The idea is that (1) Convection "conveys," implying organization of order; (2) Convection is often thought of in context of ovens which produce heat, and the OpenStack project Heat could be one possible consumer of Task Flow where task flows could be analogous to air flow in a convection oven.

What is a Task Flow (sometimes referred to as a Workflow)?
Definition Note: There are static workflows and dynamic workflows.

Isn't Workflow an overloaded term? YES! There are misconceptions about what the term Workflow actually means, and it is often used to mean things different from the definition above. This is one of the main reasons this service is now being referred to as a Task Flow Service, not a Workflow service. For Convection conversation purposes, let's define the following terminology:

Task Flow Terms

 * 1) Just-in-Sequence (Static) Task Flow:  In an academic context, a workflow is sometimes described as a collection of ordered tasks that occur with a defined start, order, and end. Some tasks may be able to execute in parallel, but a pre-determined tree of workflow steps (and parellel branches) is known before runtime, and the flow of the tree is followed upon every execution of the workflow.
 * 2) Just-in-Time (Dynamic) Event Based Task Flow:  A collection of tasks, some of which may or may not have a required order of execution, where task execution is coordinated through communication of events by individual task start/stop/status notifications.  In an event based flow system, there could be a central task execution coordinator that handles listening for events of task completion and sending events for new tasks to start.  Or, code that executes an individual task can encode its own logic to know when to execute based off events directly sent from other tasks.

I do not wish to specify the idealistic implementation here in this proposal. I simply want to document some Task Flow concepts and leverage the community for collaborative design of a useful Task Flow system for OpenStack based workloads.

TaskFlow-as-a-Service is not Orchestration
Orchestration (the purpose of project Heat), is not the same as Task Flow management. A project such as Heat could leverage a Task Flow service or code Library. A Task Flow service could leverage Heat in that one task of a meta-task-flow could be to call Heat to spin up a stack. Task Flow is concerned with "task state management'' and "storing of "rules and order" for task execution. The task system may or may not actually take responsibility for executing the tasks. Orchestration is concerned with intelligently creating, organizing, connecting, and coordinating cloud based resources, which may involve creating a task flow and/or executing tasks.

Use Cases for TaskFlow-as-a-Service
We see merit in a standalone Task Flow service that would allow for a variety of functionality to be carried out by other services (e.g. Heat could be one service to make use of Task flow). While the OpenStack project Heat focuses on orchestration of resources and resource connections, Task Flow could be responsible for:


 * A sequence of tasks that have a start and end
 * Batch processes (multiple sets of sequences of tasks with starts and ends)
 * A persistent job/process (for example an Auto-Scale policy) that remains running until manually terminated
 * A job to run for a specified duration (such as run this automated stress test for 2 days, then exit).

At a high level, one can consider Task Flows as being "batch" (with start/end) and "long running" which execute for some duration or until some triggering event occurs.

Potential TaskSystem-as-a-Service Capabilities
The following is a list of proposed capabilities for Convection. These are not necessarily required for a minimum viable service and are just ideas of what a Task Flow service might entail:

Conceptual Components

 * A task flow engine could provide generic task and state management capabilities. A task flow engine could act as a central state coordinator, enabling task flow client applications to be distributed across public cloud and on-premise deployments.  Task Flow clients offload state management to the Task Flow service thereby allowing the Task Flow clients to be stateless, scalable, and tolerant of process and client failures. The Task Flow engine could support configurable constraints at both the flow and task level, e.g. timeouts, retry count, retry intervals, etc.


 * A task flow system does not need to execute task flow logic, but it could as a value added enhancement. For example, in a simplistic implementation of a Task Flow service, the service itself could maintain task state and leave it up to the clients of Task Flow to implement the business logic of task flow execution.  An enhanced version of a Task Flow service could allow a client to provide task flow business logic to the service in a declarative DSL and the Task Flow engine could execute enforcement of the task flow business logic (e.g. notifying tasks when to run, stop, restart, etc.).


 * Since OpenStack is a cloud operating system, some operating system tools like top to see a list of running jobs in the cloud could be very useful. Tools could provide a drill down of existing task flows, currently running task flows, task flows in states of various execution: running, completed, failed, ready-to-run... and provide the ability to resubmit/retry failed task flow jobs.  Task Flow tools could also provide analytics -- metrics which could help identify performance bottlenecks or common areas of failures in a task flow that is repeated over and over.  Some possible metrics could be:  average execution time for a task flow, average execution time for individual flow tasks, task/workflow failure rates, etc.


 * A task flow repository could expose a set of pre-determined common task flows (e.g. spin up a server and add it to a load balancer). The Repository facilitates reuse and makes available a compelling set of pre-defined task flow sets.

One proposal for a Task Flow service could be that it not require clients to upload code to the Task Flow service. Clients would have full flexibility in the language/execution/deployment for the Task Flow tasks. The only requirement is that the task workers are able to access the REST API’s exposed by the service and/or receive notifications from the Task Flow system (e.g. via webhooks or some other mechanism).

Task Flow Engine
Conceptually, a Task Flow consists of a set of tasks that need to execute in a certain order. The order in which the tasks execute could be pre-determined; the ordering could also be determined dynamically based on execution results of a previous task.

Capabilities
A Task Flow Engine could provide the following features:
 * 1) Register a Task Flow and the tasks associated with the task flow via REST API calls
 * 2) Ability to specify configurable constraints at the Flow and the task level i.e. timeouts, retry count, retry interval, etc.
 * 3) Invoke Task Flow instances
 * 4) Query the state of a Task Flow instance
 * 5) Query for a list of all the running Task Flow instances for a given Task Flow definition
 * 6) Support versioning of Task Flow definitions
 * 7) Cancel a Task Flow instance
 * 8) Support multiple, parallel invocations of Task Flows
 * 9) A Task Flow instance could invoke another task flow instance [Master-child task flows]

Datastore
The following information could be stored in the Task Flow service datastore:
 * 1) List of registered flows, tasks, and the associated constraints like timeouts, retries
 * 2) Execution state for the Task Flow instances (completed, running, error, ready to run)
 * 3) Scheduled Task Queues. The Task Flow engine could maintain a task queue for each of the registered task types. The Task Flow engine could publish task items to the task queues when a task needs to be scheduled for execution
 * 4) Task Flow Process Context containing the runtime information associated with a given task flow instance i.e. the input data that came from the application that invoked the task flow, the output data generated by the Task Flow tasks, and any other data needed for administering the task flow instance, like the start time, running duration, etc.

Conceptual Diagram
The diagram below depicts a possible interaction between the Task Flow engine and a Task Flow client making use of the service. The green boxes are implemented by the Task Flow client. Note that while the diagram below shows an interaction where it is expected that the client will poll the engine for state (i.e. there are no notifications being sent from the engine), one could envision a system where poll, push, or a combination of methods are used to "notify" about state changes.



Strategy for Implementation
April, 2013: At the Havana design summit, it was proposed (and generally agreed upon) that a Task Flow System is valuable to a number of projects and customer cloud use cases and should be developed. The following general approach is desired:
 * 1) Incubate a Task Flow/System library function within Heat
 * 2) Graduate and propose the library to Oslo upon ensured stability and reasonable maturity
 * 3) Develop a standalone TaskSystem service using the capabilities of the Task Flow library in Oslo

The following is the presentation that was given at the Havana summit that led to the outcomes noted above:

The etherpad notes collected during the un-conference presentation are as follows: https://etherpad.openstack.org/Convection