Difference between revisions of "Heat/TaskSystemRequirements"
(Created page with "This page is an attempt to document the requirements for any task/workflow system to be used by Heat for orchestration. == Existing requirements == These requirements are met...") |
|||
Line 35: | Line 35: | ||
===== Tasks can write to the database ===== | ===== Tasks can write to the database ===== | ||
Preferably without opening a new connection, or creating the risk of stale DB caches (for other tasks/the caller) in sqlalchemy. | Preferably without opening a new connection, or creating the risk of stale DB caches (for other tasks/the caller) in sqlalchemy. | ||
+ | |||
+ | ===== Tasks should be inherently thread-safe ===== | ||
+ | Eventlet-style hackery doesn't count. | ||
===== Existing integration tests must work ===== | ===== Existing integration tests must work ===== |
Revision as of 15:20, 20 June 2013
This page is an attempt to document the requirements for any task/workflow system to be used by Heat for orchestration.
Contents
- 1 Existing requirements
- 1.1 Task requirements
- 1.1.1 Tasks run in parallel
- 1.1.2 Tasks can spawn other tasks
- 1.1.3 Tasks propagate exceptions
- 1.1.4 Tasks can time out
- 1.1.5 Task cancellation should cancel any subtasks
- 1.1.6 Tasks can clean up after themselves
- 1.1.7 Tasks don't make debugging unnecessarily difficult
- 1.1.8 Tasks can modify state
- 1.1.9 Tasks can write to the database
- 1.1.10 Tasks should be inherently thread-safe
- 1.1.11 Existing integration tests must work
- 1.2 Workflow requirements
- 1.1 Task requirements
- 2 Future requirements
Existing requirements
These requirements are met by the current task system in Heat, which uses coroutine-based tasks similar to the ones in Tulip.
Task requirements
Tasks are implemented in the TaskRunner
class in scheduler.py
. It should be possible to completely swap out the task system by replacing TaskRunner
with another task implementation (though it will have to cope with coroutines).
Tasks run in parallel
A lot of resources are slow to start so, where there are no dependencies preventing it, tasks must run in parallel.
Tasks can spawn other tasks
For example, creating a Nova server may require attaching multiple volumes to it. These volume attachments need to happen in parallel with each other, and with other tasks that are running at the same time.
We also support resources that are themselves stacks, and operations performed on the parent stack are typically mirrored on the resource-stacks. Once again, these must happen in parallel with other resources in the parent stack.
Tasks propagate exceptions
Errors are returned in the form of exceptions (not return values). Wrapping exceptions is OK though.
Tasks can time out
A task can (optionally) be cancelled if it fails to complete within a given time. This should be (easily) configurable dynamically.
Task cancellation should cancel any subtasks
If a task is cancelled, any tasks that it has spawned also need to be cancelled.
Tasks can clean up after themselves
If a task is cancelled or timed out, it should have the chance to clean up (e.g. by catching an exception) before exiting.
Tasks don't make debugging unnecessarily difficult
Given a debug log containing a stack trace, it should be easy to work out in which of several tasks running in parallel an error has occurred. Note that most tasks running in parallel are probably identical except for their arguments (e.g. they're all Resource.create(), but for different Resource objects).
Tasks can modify state
When a task modifies some piece of application state, it shouldn't be necessary to reload everything from the database in order for that to be reflected in other tasks and in the caller. (This implies that tasks run in the same process.)
Tasks can write to the database
Preferably without opening a new connection, or creating the risk of stale DB caches (for other tasks/the caller) in sqlalchemy.
Tasks should be inherently thread-safe
Eventlet-style hackery doesn't count.
Existing integration tests must work
Most "unit" tests in Heat (and almost all of the important ones) are integration tests that typically involve running a whole workflow (not just a single task), e.g. creating a whole stack. These tests need to run in such a way that:
- They are fairly representative of real use (e.g. tasks still run in parallel)
- Mock objects are preserved
- Calls to mock objects are recorded and ordered correctly, even across tasks
- Exceptions are propagated back to the test
Workflow requirements
There are two basic workflows implemented in Heat:
-
PollingTaskGroup
: starts a number of tasks in parallel and waits until they are all complete. -
DependencyTaskGroup
: runs a parallelised workflow for an arbitrary dependency graph.
Both workflows are implemented in scheduler.py
.
Workflows are runtime configurable
Dependency graphs are built at runtime based on the user's template, so it must be easy to dynamically put together a workflow.
Task dependencies can be arbitrarily complex
Parallel tasks should respect dependencies in the form of an arbitrary directed acyclic graph.
All workflow tasks are cancelled on failure
In the event of a task reporting a failure, all other tasks in the same workflow need to be cancelled.
Workflows can be cancelled
The user may decide to stop a workflow that is in progress; this should cancel all of the tasks within the workflow.
Future requirements
The are some features that we either don't have or don't use yet, but would like to use in the future.
Task requirements
Tasks have synchronisation points
Currently we add a dependency between any two resources where one gets data from the other (so the data reader can only start once the data source has finished). However, in most cases the data is actually available much earlier.
This is fairly easy to implement in the current coroutine system, since the explicit "yield" acts as a synchronisation point.
Tasks can be retried
If a task fails, we want to be able to (optionally) retry it. This should be (easily) configurable dynamically.
Workflow requirements
Workflows can be rolled back
If a task in the workflow fails then, as well as cancelling all tasks, any tasks that have been started should be rolled back.
Workflows report their state
Before a workflow begins rolling back, the caller needs to have the opportunity to:
- record this; and
- optionally, cancel it
So if e.g. a resource fails, we want to put the stack into the rollback state before rolling back.
(This is the hardest part to implement nicely in the current system.)