Revised on: 8/24/2013 by Harlowja
This wiki can be used to document some of the core foundational pieces of a workflow primitive library.
A job is the initial (and any derivative) set of tasks & workflows required to fulfill an action. It has an identifier which can be used to track the progress of the job as its underlying tasks & workflows transition between states.
This would form the underlying workflow component, it could be a simple object likely with apply() and revert() methods. It would perform some action, which revert() could attempt to undo (if applicable).
Workflows would be a set of common patterns that order tasks in various ways but would be separated from the task itself. You could imagine that a workflow could be [get up in the morning, take shower, go to work]. Each task there could be independently applied but this would likely not be the correct sequence. A common linear sequence would likely be the correct ordering. This would be one example of a pattern that a set of tasks would go through (aka a linear workflow), one that should not need to have code duplicated to accomplish. Such a set of common patterns that perform said workflow (where tasks are attached to said workflow pattern) would be very useful to have to avoid creating arbitrary and ad-hoc workflows.
Extras: workflows as a whole could also have a revert() method which can allow for different types of workflow level reconciliation strategies (one such strategy could be undoing each individual task in the workflow, another could be doing a simpler operation to undo the workflows results).
To avoid repeated workflow patterns certain repeated sequences of tasks should be provided as primitives.
Some of the patterns could be:
- Linear independent tasks
- Tasks with input and output ordering dependencies (non-cyclical)
- Distributed tasks with input and output ordering dependencies (potentially-cyclical)
There needs to be a concept of a ownership on a given job and its associated workflow (and later resources) which can be used to guarantee said workflow is only worked on (and owned) by one entity at a given time.
In order to ensure that jobs and associated resources are not used or worked on by multiple simultaneous entities there needs to be a concept of a locking service that provides locks back for the various workflow locking use-cases.
See: StructuredWorkflowLocks for a more detailed discussion.
There needs to be a concept of a way to get a job to a entity that can work on said job. One could imagine a job ownership 'service' (similar to the physical concept of a job board), which would be used to post & atomically claim an actionable job. For example something like the nova-api to another the conductor. Currently the MQ (in combination with the DB) is used to post and claim actionable pieces of work, but if the concept is generalized there could be a MQ+DB backend or a in-memory backend or a zookeeper backend as ways to post and atomically claim jobs.
Reclamation: The other concept that needs to exist (and which each backend can provide) is the concept of a job reclamation and/or reposting. This is needed to be able to detect when jobs have failed and by some mechanism (may or may not be reposted to the job board) having the capability to reclaim jobs when an entity processing them fails.
Extras: Possible other API extensions could also be added to determine the current entity processing a job and its status and depending on the backing ownership service there could be further extensions that allow for manual transfer of failed workflows to other 'entities' (the ZK impl. could likely do this automatically so may not need said extensions).
In order to be able to do resumption of tasks there needs to be enough associated history for what tasks/workflows have occurred to be able to have said workflow ownership be resumable & reclaimable & revertable. Likely there can be a database (or zookeeper backed) log that matches the concept of a ships logbook to track such information. It would store things like which tasks/workflows that have occurred and for each task/workflow there would be a reference to a description of what occurred (the metadata part). Both the log and the associated metadata would be needed in order to do correct rollback.
Reserve/configure/acquire (or release): In order to be able to correctly undo resource allocations, for each api/library or system that is integrated with there needs to be semantics in said api/library to be able to first reserve the resource (but not power it on). Then secondly there needs to be a way to configure said resource and finally a acquire semantic (could be called a power it on synonym). If any of those 3 stages fail then there must be a way to destroy said resource (either by a simple destroy() functionality, or via a [poweroff, unconfigure, unreserve] functionality) so that said resource can be released.