Jump to: navigation, search

Heat/TaskSystemRequirements

< Heat
Revision as of 14:48, 20 June 2013 by Zaneb (talk | contribs) (Created page with "This page is an attempt to document the requirements for any task/workflow system to be used by Heat for orchestration. == Existing requirements == These requirements are met...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This page is an attempt to document the requirements for any task/workflow system to be used by Heat for orchestration.

Existing requirements

These requirements are met by the current task system in Heat, which uses coroutine-based tasks similar to the ones in Tulip.

Task requirements

Tasks are implemented in the TaskRunner class in scheduler.py. It should be possible to completely swap out the task system by replacing TaskRunner with another task implementation (though it will have to cope with coroutines).

Tasks run in parallel

A lot of resources are slow to start so, where there are no dependencies preventing it, tasks must run in parallel.

Tasks can spawn other tasks

For example, creating a Nova server may require attaching multiple volumes to it. These volume attachments need to happen in parallel with each other, and with other tasks that are running at the same time.

We also support resources that are themselves stacks, and operations performed on the parent stack are typically mirrored on the resource-stacks. Once again, these must happen in parallel with other resources in the parent stack.

Tasks propagate exceptions

Errors are returned in the form of exceptions (not return values). Wrapping exceptions is OK though.

Tasks can time out

A task can (optionally) be cancelled if it fails to complete within a given time. This should be (easily) configurable dynamically.

Task cancellation should cancel any subtasks

If a task is cancelled, any tasks that it has spawned also need to be cancelled.

Tasks can clean up after themselves

If a task is cancelled or timed out, it should have the chance to clean up (e.g. by catching an exception) before exiting.

Tasks don't make debugging unnecessarily difficult

Given a debug log containing a stack trace, it should be easy to work out in which of several tasks running in parallel an error has occurred. Note that most tasks running in parallel are probably identical except for their arguments (e.g. they're all Resource.create(), but for different Resource objects).

Tasks can modify state

When a task modifies some piece of application state, it shouldn't be necessary to reload everything from the database in order for that to be reflected in other tasks and in the caller. (This implies that tasks run in the same process.)

Tasks can write to the database

Preferably without opening a new connection, or creating the risk of stale DB caches (for other tasks/the caller) in sqlalchemy.

Existing integration tests must work

Most "unit" tests in Heat (and almost all of the important ones) are integration tests that typically involve running a whole workflow (not just a single task), e.g. creating a whole stack. These tests need to run in such a way that:

  • They are fairly representative of real use (e.g. tasks still run in parallel)
  • Mock objects are preserved
  • Calls to mock objects are recorded and ordered correctly, even across tasks
  • Exceptions are propagated back to the test

Workflow requirements

There are two basic workflows implemented in Heat:

  1. PollingTaskGroup: starts a number of tasks in parallel and waits until they are all complete.
  2. DependencyTaskGroup: runs a parallelised workflow for an arbitrary dependency graph.

Both workflows are implemented in scheduler.py.

Workflows are runtime configurable

Dependency graphs are built at runtime based on the user's template, so it must be easy to dynamically put together a workflow.

Task dependencies can be arbitrarily complex

Parallel tasks should respect dependencies in the form of an arbitrary directed acyclic graph.

All workflow tasks are cancelled on failure

In the event of a task reporting a failure, all other tasks in the same workflow need to be cancelled.

Workflows can be cancelled

The user may decide to stop a workflow that is in progress; this should cancel all of the tasks within the workflow.

Future requirements

The are some features that we either don't have or don't use yet, but would like to use in the future.

Task requirements

Tasks have synchronisation points

Currently we add a dependency between any two resources where one gets data from the other (so the data reader can only start once the data source has finished). However, in most cases the data is actually available much earlier.

This is fairly easy to implement in the current coroutine system, since the explicit "yield" acts as a synchronisation point.

Tasks can be retried

If a task fails, we want to be able to (optionally) retry it. This should be (easily) configurable dynamically.

Workflow requirements

Workflows can be rolled back

If a task in the workflow fails then, as well as cancelling all tasks, any tasks that have been started should be rolled back.

Workflows report their state

Before a workflow begins rolling back, the caller needs to have the opportunity to:

  1. record this; and
  2. optionally, cancel it

So if e.g. a resource fails, we want to put the stack into the rollback state before rolling back.

(This is the hardest part to implement nicely in the current system.)