Jump to: navigation, search

NovaOrchestration/WorkflowEngines/pyutilib workflow

< NovaOrchestration‎ | WorkflowEngines
Revision as of 23:30, 17 February 2013 by Ryan Lane (talk | contribs) (Text replace - "__NOTOC__" to "")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

pyutilab.workflow notes

Summary

pyutilab.workflow is part of pyutilab, an ensemble of Python packages developed to support work at Sandia National Laboratories

The library was primarily developed to support automation of scientific workflows at Sandia Labs, and has been made available for public use. It's not clear how extensive the use of this library is outside of Sandia, but it is actively maintained.

It is a pure Python library. It has one external dependency on argparse, and dependencies on some other parts of the pyutilib library, which are automatically installed.

Documentation is primarily an overview paper, and code level documentation is light to none. The code is clean, and there are unit tests. Didn't verify coverage, but the tests seem reasonable.

Licensing 
BSD https://software.sandia.gov/trac/pyutilib/wiki/Licensing
Packaging 
Packages are listed in the Python package index, and
installable with easy_install
http://pypi.python.org/pypi/pyutilib.workflow/2.2.4.
Python Versions 
Python 2.4, 2.5 or 2.6 (2.4 is deprecated.) I used it with Python 2.7 with no issues.
Blog 
https://software.sandia.gov/trac/pyutilib/blog
Documentation 
https://software.sandia.gov/svn/public/pyutilib/pyutilib.workflow/trunk/doc/workflow/workflow.pdf
Dependencies 
Some other parts of pyutilib, handled by the installer. Also, argparse (in standard library as of 2.7 / 3.2)

Functionality

The library provides two core objects - a Workflow and a Task. A Workflow is defines a sequence of steps, defined by one or more Tasks. A Task declares is inputs and outputs, and is responsible for mapping the input data to the output data. Tasks have an execute method, which does the work.

Workflows are defined by creating Tasks, defining the connections between the task inputs and outputs, and then adding the tasks to a workflow. The syntax is clean and straightforward. Behind the scenes, Connector objects are created to represent the connections.

Workflows are not restricted to simple sequences - they also support fan-out and fan-in. Fan-out allows an output port to be connected to multiple inputs. In this case, the data will replicated to the multiple input ports. Fan-in allows an input port to be connected to multiple outputs. In the case of fan-in, the multiple inputs can be appended to a list, or can be stored in a dictionary indexed by the output port id.

Workflows have implicit inputs and outputs, based on the inputs and outputs of the first and last tasks. A Workflow is executed by calling it as a function ( object.call() ,) and passing in the initial values. The result is a dictionary like object, with the output port names and values.

See attachment:workflow1.py for a code example. This is just an invented flow, loosely modeled after openstack operations.

General comments

The code is decent, and the library is fairly easy to use.

It's possible to nest workflows and tasks, to create larger flows.

Error handling is decent; most of my logic mistakes generated a reasonable error when the workflow was being assembled. Some errors were head scratchers if the linking was really mixed up.

The library is based on data flow, it's not really a Petri Net or state machine. There are fan-out and fan-in capabilities, but no decisions or iteration.

When a workflow is running, it's more or less synchronous. It could probably be fired off in a thread, but there's no concept of querying it's state while running.

Independent tasks don't run concurrently, and there isn't any internal threading support or anyhting like that.

The library supports a 'resource' concept, which is a simple way of coordinating access to shared resources, typically files. It's not true locking.

There's no persistence of state behind flows, everything is in memory.

There's no specific support for handling exceptions within user code, so an exception will just get propogated out of the workflow. Internally, the library catches errors and generates exceptions like any other code might.

There's no import/export, or printing of workflows.