Heat/TaskSystemRequirements

This page is an attempt to document the requirements for any task/workflow system to be used by Heat for orchestration.

Existing requirements

These requirements are met by the current task system in Heat, which uses coroutine-based tasks similar to the ones in Tulip.

Task requirements

Tasks are implemented in the TaskRunner class in scheduler.py. It should be possible to completely swap out the task system by replacing TaskRunner with another task implementation (though it will have to cope with coroutines).

[jh]: that's heat specific code, but the concept of swapping out different backends will be provided

Tasks run in parallel

A lot of resources are slow to start so, where there are no dependencies preventing it, tasks must run in parallel.

[jh]: sure

Tasks can spawn other tasks

For example, creating a Nova server may require attaching multiple volumes to it. These volume attachments need to happen in parallel with each other, and with other tasks that are running at the same time.

We also support resources that are themselves stacks, and operations performed on the parent stack are typically mirrored on the resource-stacks. Once again, these must happen in parallel with other resources in the parent stack.

[jh]: are the stacks modified/added while running, or are all stacks 'compiled' and a fixed set of stacks is ran?

Tasks propagate exceptions

Errors are returned in the form of exceptions (not return values). Wrapping exceptions is OK though.

[jh]: if they run in parallel can't you have multiple exceptions being thrown at once (from different paths), how is this handled? which exception finally exits the task system?

Tasks can time out

A task can (optionally) be cancelled if it fails to complete within a given time. This should be (easily) configurable dynamically.

[jh]: sure

Task cancellation should cancel any subtasks

If a task is cancelled, any tasks that it has spawned also need to be cancelled.

[jh]: depending on answer to spawning question this seems ok.

Tasks can clean up after themselves

If a task is cancelled or timed out, it should have the chance to clean up (e.g. by catching an exception) before exiting.

[jh]: so this brings up a question of what is a task, and if a task needs to cleanup before it runs most of its code, why is it 1 instead of N tasks to begin with? and why is it running and then its cancelled, if something can run then be cancelled shouldn't the thing that ran before being cancelled be its own task and the thing that runs after be its own task as well?

Tasks don't make debugging unnecessarily difficult

Given a debug log containing a stack trace, it should be easy to work out in which of several tasks running in parallel an error has occurred. Note that most tasks running in parallel are probably identical except for their arguments (e.g. they're all Resource.create(), but for different Resource objects).

[jh]: this seems like an application choice, not something a library should necessarily prescribe, if said library wants to run in a distributed manner then it may have to take hits with the ease of debugging. if it doesn't want to run in a distributed manner then it doesn't (or may not) have to take that hit.

Tasks can modify state

When a task modifies some piece of application state, it shouldn't be necessary to reload everything from the database in order for that to be reflected in other tasks and in the caller. (This implies that tasks run in the same process.)

[jh]: this seems like an application choice, not something a library should necessarily prescribe, if said library wants to run in a distributed manner then it may have to take hits with reloading data. if it doesn't want to run in a distributed manner then it doesn't (or may not) have to take that hit.

Tasks can write to the database

Preferably without opening a new connection, or creating the risk of stale DB caches (for other tasks/the caller) in sqlalchemy.

[jh]: sure

Tasks should be inherently thread-safe

Eventlet-style hackery doesn't count.

[jh]: unsure what this means, tasks are developer provided code, its not up to a task system to enforce that tasks should be thread-safe (how can it).

Existing integration tests must work

Most "unit" tests in Heat (and almost all of the important ones) are integration tests that typically involve running a whole workflow (not just a single task), e.g. creating a whole stack. These tests need to run in such a way that:

They are fairly representative of real use (e.g. tasks still run in parallel)
Mock objects are preserved
Calls to mock objects are recorded and ordered correctly, even across tasks
Exceptions are propagated back to the test

[jh]: ok, this seems like heat integration requirements

Workflow requirements

There are two basic workflows implemented in Heat:

PollingTaskGroup: starts a number of tasks in parallel and waits until they are all complete.
DependencyTaskGroup: runs a parallelised workflow for an arbitrary dependency graph.

Both workflows are implemented in scheduler.py.

[jh]: ok, this is implemented in taskflow.

Workflows are runtime configurable

Dependency graphs are built at runtime based on the user's template, so it must be easy to dynamically put together a workflow.

[jh]: so this is not really runtime? this is arbitrary flow construction while the application is running, but not modification while the flow is running?

Task dependencies can be arbitrarily complex

Parallel tasks should respect dependencies in the form of an arbitrary directed acyclic graph.

[jh]: sure

All workflow tasks are cancelled on failure

In the event of a task reporting a failure, all other tasks in the same workflow need to be cancelled.

[jh]: so this is a cancellation policy that we can support (not all users want to cancel all other tasks in the flow when one task fails)

Workflows can be cancelled

The user may decide to stop a workflow that is in progress; this should cancel all of the tasks within the workflow.

[jh]: sure

Future requirements

The are some features that we either don't have or don't use yet, but would like to use in the future.

Task requirements

Tasks have synchronisation points

Currently we add a dependency between any two resources where one gets data from the other (so the data reader can only start once the data source has finished). However, in most cases the data is actually available much earlier.

This is fairly easy to implement in the current coroutine system, since the explicit "yield" acts as a synchronisation point.

[jh]: i think something like this can be accommodated (without needing yield concepts), although needs more investigation.

Tasks can be retried

If a task fails, we want to be able to (optionally) retry it. This should be (easily) configurable dynamically.

[jh]: sure

Workflow requirements

Workflows can be rolled back

If a task in the workflow fails then, as well as cancelling all tasks, any tasks that have been started should be rolled back.

[jh]: sure, although this is only one of many rollback strategies (rolling back all started when any task fails)

Workflows report their state

Before a workflow begins rolling back, the caller needs to have the opportunity to:

record this; and
optionally, cancel it

So if e.g. a resource fails, we want to put the stack into the rollback state before rolling back.

(This is the hardest part to implement nicely in the current system.)

[jh]: sure

(asalkeld) Workflow Logging

I guess only partly related to workflow, but... If you have multiple workflows running concurrently the logging is really confusing and never available to the user. Suggestion:

when a workflow starts create a log destination
all tasks in the workflow log to that destination
we can then return that / save it to db/swift

Neat as the user can retrieve it as a single entity, basically a nice record of how there request was handled/or failed.

[jh]: seems possible, might not be that hard?

Heat/TaskSystemRequirements

Contents

Existing requirements

Task requirements

Tasks run in parallel

Tasks can spawn other tasks

Tasks propagate exceptions

Tasks can time out

Task cancellation should cancel any subtasks

Tasks can clean up after themselves

Tasks don't make debugging unnecessarily difficult

Tasks can modify state

Tasks can write to the database

Tasks should be inherently thread-safe

Existing integration tests must work

Workflow requirements

Workflows are runtime configurable

Task dependencies can be arbitrarily complex

All workflow tasks are cancelled on failure

Workflows can be cancelled

Future requirements

Task requirements

Tasks have synchronisation points

Tasks can be retried

Workflow requirements

Workflows can be rolled back

Workflows report their state

(asalkeld) Workflow Logging