Jump to: navigation, search

Difference between revisions of "TaskFlow/Persistence"

(Blanked the page)
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''Revised on:''' {{REVISIONMONTH1}}/{{REVISIONDAY}}/{{REVISIONYEAR}} by {{REVISIONUSER}}
 
  
== Overview ==
 
 
 
 
A persistence API as well as base persistence types are provided with taskflow for the purpose of ensuring that jobs, flows, and there associated  tasks can be backed up in a database or in memory (or elsewhere). The user, when configuring the persistence API, has the option to specify which backend is desired and subsequently store and retrieve the data associated with the jobs, flows, and tasks in use.
 
 
=== Why? ===
 
 
[[File:Machovka_harddisk.png|frame|right]]
 
 
* Allows for reconstruction and resumption of flows and there associated tasks.
 
* Allows for redundant checks that expected data is provided.
 
* Allows for the user to view the history of a jobs, flows and there associated tasks.
 
* Facilitates debugging of taskflow usage and integration (and runtime/post-runtime analysis).
 
 
== Backends ==
 
 
=== Configuration ===
 
 
When configuring the backend to use, a [http://stevedore.readthedocs.org/en/latest/ stevedore] driver (which uses python entrypoints) can be specified to locate the backend that your applications desires to use. This allows for easy extensibility of the backend that your application may plan to use (and does not limit the selection of backends to those that are included by default).
 
 
=== Defaults ===
 
* [http://www.sqlalchemy.org/ SQLAlchemy]:
 
** Makes use of the sqlalchemy library to store all data in a SQLite (or postgres or mysql) database.
 
** Will be persisted in the event of a system failure.
 
* In-memory:
 
** Makes use of a in-memory dictionaries to store data in memory in a thread-safe manner.
 
** Will '''NOT''' be persisted in the event of a system failure.
 
* More to come...
 
 
== Types ==
 
Regardless of the backend chosen to persist taskflow data, the generic API (taskflow.persistence.backends.api) must always return one of the following types.
 
 
=== [https://en.wikipedia.org/wiki/Logbook Logbook] ===
 
* Stores a collection of flow details + any metadata about the logbook (last_updated, deleted, name...).
 
* Typically connected to [[StructuredWorkflowPrimitives|job]] with which the logbook has a one-to-one relationship.
 
* Provides all of the data necessary to automatically reconstruct a job object.
 
 
{| class="wikitable"
 
|-
 
|'''Field'''
 
|'''Description'''
 
|-
 
|Name
 
|Name of the logbook
 
|-
 
|UUID
 
|Unique identifier for the logbook
 
|-
 
|Meta
 
|JSON blob of non-indexable associated logbook information
 
|}
 
 
=== Flow detail ===
 
* Stores a collection of task details, metadata about the flow and potentially any task relationships.
 
* Persistence representation of a specific run instance of a [[StructuredWorkflowPrimitives|flow]].
 
* Provides all of the details necessary for automatic reconstruction of a flow object.
 
 
{| class="wikitable"
 
|-
 
|'''Field'''
 
|'''Description'''
 
|-
 
|Name
 
|Name of the flow
 
|-
 
|Type
 
|Type of the flow (mod.cls format)
 
|-
 
|UUID
 
|Unique identifier for the flow
 
|-
 
|State
 
|State of the flow
 
|-
 
|Meta
 
|JSON blob of non-indexable associated flow information
 
|}
 
 
=== Task detail ===
 
 
Stores all of the information associated with one specific run instance of a [[StructuredWorkflowPrimitives|task]].
 
 
{| class="wikitable"
 
|-
 
|'''Field'''
 
|'''Description'''
 
|-
 
|Name
 
|Name of the task
 
|-
 
|Type
 
|Type of the flow (mod.cls format)
 
|-
 
|UUID
 
|Unique identifier for the task
 
|-
 
|State
 
|State of the task
 
|-
 
|Results
 
|Results that the task may have produced
 
|-
 
|Exception
 
|Serialized exception that the task may have produced
 
|-
 
|Stack trace
 
|Stack trace of the exception that the task may have produced
 
|-
 
|Version
 
|Version of the task that was ran
 
|-
 
|Meta
 
|JSON blob of non-indexable associated task information
 
|}
 
 
== Storage ==
 
 
Now that we already have storage in taskflow -- that is the logbook (which is itself connected or derived/saved to a given backend). It should be emphasized that logbook  is  the authoritative, and, preferably, the '''only''' source of runtime state information. When task returns result, it should be written directly to logbook. When task or flow state changes in any way, logbook is first to know. Flow should '''not''' store task results -- there is logbook for that.
 
 
Logbook and a backend are responsible to store the actual data -- these together specify the persistence mechanism (how data is saved and where -- memory, database,
 
whatever), and persistence policy (when data is saved -- every time it changes or at some particular moments or simply never). To make these components simpler to use we have come up with the concept of a storage API; this API allows engines to easily call into the storage layer and avoid the details about logbooks, flowdetails, taskdetails and backends.
 
 
== Checkpointing ==
 
 
A WIP topic/discussion is the concept of check-pointing.
 
 
'''See:''' [[TaskFlow/Checkpointing|Checkpointing]]
 
 
== Contributors ==
 
* Kevin Chen (Rackspace)
 
* Joshua Harlow (Yahoo!)
 
* Jessica Lucci (Rackspace)
 

Latest revision as of 05:32, 27 April 2014