Mistral

Mistral is a task management service. It is also known as Workflow as a Service. Most business processes consist of multiple distinct interconnected steps that need to be executed in a particular order. One can describe such process as a set of tasks and task relations and upload such description to Mistral so that it takes care of state management, correct execution order, task distribution and high availability. Mistral also provides flexible task scheduling so that we can run a process according to a specified schedule (i.e. every Sunday at 4.00pm) instead of running it immediately. We call such set of tasks and dependencies between them a task graph. Independent routes in this graph are called flows and Mistral can execute them in parallel.

Use cases

Tasks Scheduling - Cloud Cron

Problem Statement

Pretty often while administering a network of computers there’s a need to establish periodic on-schedule execution of maintenance jobs for doing various kinds of work that otherwise would have to be started manually by a system administrator. The set of such jobs ranges widely from cleaning up needless log files to health monitoring and reporting. One of the most commonly known tools in Unix world to set up and manage those periodic jobs is Cron. It perfectly fits the uses cases mentioned above. For example, using Cron we can easily schedule any system process for running every even day of week at 2.00 am. For a single machine it’s fairly straightforward how to administer jobs using Cron and the approach itself has been adopted by millions of IT folks all over the world. Now what if we want to be able to set up and manage on-schedule jobs for multiple machines? It would be very convenient to have a single point of control over their schedule and themselves (i.e. “when” and “what”). Furthermore, when it comes to a cloud environment the cloud provides additional RESTful services (and not only RESTful) that we may also want to call in on-schedule manner along with operating system local processes. Picture 1. Managing on-schedule local jobs manually.

Solution

Mistral service for OpenStack cloud addresses this demand naturally. Its capabilities allow configuring any number of tasks to be run according to a specified schedule in a scale of a cloud. Here’s the list of some typical jobs we can choose from: Run a shell script on specified virtual instances (e.g. VM1, VM3 and VM27). Run an arbitrary system process on specified instances. Start/Reboot/Shutdown instances. Call an accessible cloud services (e.g. Trove). Add instances to a load balancer. Deploy an application on specified instances. This list is not full and any other user meaningful jobs can be added. To make it possible Mistral provides a plugin mechanism so that it’s pretty easy to add new functionality via supplying new Mistral plugins.

Picture 2. Mistral provides a single point of control over on-schedule cloud jobs. Basically, Mistral acts as a mediator between a user, virtual instances and cloud services in a sense that it brings capabilities over them like task management (start, stop etc.), task state and execution monitoring (success, failure, in progress etc.) and task scheduling. Since Mistral is a distributed workflow engine those types of jobs listed above can be combined in a single logical unit, a workflow. For example, we can tell Mistral to take care of the following workflow for us: On every Monday at 1.00 am start grepping phrase “Hello, Mistral!” from log files located at /var/log/myapp.log on instances VM1, VM30, VM54 and put the results in Swift. On success: Generate the report based on the data in Swift. On success: Send the generated report to an email address. On failure: Send an SMS with error details to a system administrator. On failure: Send an SMS with error details to a system administrator. A workflow similar to the one described above may be of any complexity but still considered a single task from a user perspective. However, Mistral is smart enough to analyze the workflow and identify individual sequences that can be run in parallel thereby taking advantage of distribution and load balancing under the hood. It is worth noting that Mistral is nearly linearly scalable and hence is capable to schedule and process virtually any number of tasks simultaneously.

Notes

So in this use case description we tried to show how Mistral capabilities can be used for scheduling different user tasks in a cloud scale. Semantically it would be correct to call this use case Distributed Con or Cloud Cron. One of the advantages of using a service like Mistral in case like this is that along with base functionality to schedule and execute tasks it provides additional capabilities like navigating over task execution status and history (using web UI or REST API), replaying already finished tasks, on-demand task suspension and resumption and many other things that are useful for both system administrators and application developers.

Live migration

A user specifies tasks for VM live migration triggered upon an event from Ceilometer (CPU consumption 100%).

Long-running business process

A user makes a request to run a complex multi-step business process and wants it to be fault-tolerant so that if the execution crashes at some point on one node then another active node of the system can automatically take on and continue from the exact same point where it stopped. In this use case the user splits the business process into a set of tasks and let Mistral handle them in a sense that it serves as a coordinator and decides what particular task should be started at what time. So that Mistral calls back with "Execute action X, here is the data". If an application that executes action X dies then another instance takes the responsibility to continue the work.

BigData analysis & reporting

A data analyst can use Mistral as a tool for data crawling. For example, in order to prepare a financial report the whole set of steps for gathering and processing required report data can be represented as a graph of related Mistral tasks. As with other cases, Mistral makes sure to supply fault tolerance, high availability and scalability.

Rationale

The main idea behind this services includes the following main points: Ability to upload custom task graph definitions. Graph definitions should be agnostic of any details of specific domains (like orchestration, deployment and so forth). The actual task execution is not performed by the service itself. The service rather serves a coordinator for other worker processes that do the actual work and notify back about task execution results. In other words, task execution should be asynchronous thus providing flexibility for plugging in any domain specific handling and opportunities to make this service scalable and highly available. The service must not contain a predefined set of actions that can be performed. All actions are specific to a particular task graph and described along with the graph itself using simple DSL. Basically, actions represent generic actions that the state machine can schedule to be executed on a worker. The worker itself has a knowledge about how to interpret the task graph actions and do the specific work.

Terminology

Task graph Graph of all possible tasks and valid transitions between them.
Flow Route in a task graph that reflects one possible set of actions performed in a linear fashion. At the same time, the service logically can run individual flows independently thereby leaving freedom for various optimization on an implementation level such as using multiple parallel worker threads.
Session A particular execution. That is, for the given task graph definition and chosen task the service should perform all required actions (subtasks) in order to complete this task. All transitions must be compliant to allowed configured transitions in the task graph definition. Identified by session_id.
Task Defines a flow execution step. Each task is defined with its dependant tasks which the flow execution can jump from in order to reach that task. Identified by session_id + task_name.
Target task The task that a client needs to execute at some point in time. Any task can be chosen as target task in the task graph definition. Once this task has been processed with success the session is considered completed.
Action A particular instruction associated with a task that needs to be performed once the task dependencies are satisfied.
Task state A task can be in a number of predefined states reflecting its current status:

INACTIVE - task dependencies are not satisfied.
PENDING - task dependencies are satisfied but task hasn’t started yet.
RUNNING - task is currently being executed.
SUCCESS - task has finished successfully.
FAILURE - task has finished with an error. All the actual task states belonging to current Session are persisted in DB under session_id key.

Trigger There are several types of conditions which cause a new session to be created when it is met. The actual condition can occur many times and each time (with some limitations specified in the condition itself) a new session will be created.

Design

There is no final decision on the service design. It is actively discussed in mailing lists and IRC #openstck-mistal.

Implementation

There is no implementation yet.

Links & IRC

Project at Launchpad: http://launchpad.net/mistral
Weekly IRC meeting is held on Mondays at 16:00 UTC on #openstack-meeting at Freenode.
Weekly IRC meeting agenda: https://wiki.openstack.org/wiki/Meetings/MistralAgenda

FAQ

Q: What is Mistral?
A: Mistral is a task management service. It is also known as Workflow as a Service. Most business processes consist of multiple distinct interconnected steps that need to be executed in a particular order. One can describe such process as a set of tasks and task relations and upload such description to Mistral so that it takes care of state management, correct execution order, task distribution and high availability. Mistral also provides flexible task scheduling so that we can run a process according to a specified schedule (i.e. every Sunday at 4.00pm) instead of running it immediately. We call such set of tasks and dependencies between them a task graph. Independent routes in this graph are called flows and Mistral can execute them in parallel.

Q: Why offload business processes to 3rd party service?
A: Reason 1: High Availability. A typical application’s workflow consists of many independent tasks like collecting data, processing, resource acquiring, obtaining user input, reporting, sending notifications, replicating data etc. All of the steps must happen in appropriate time as they depend on each other. Many such processes can run in parallel. Now if your application crashes somewhere in the middle or a power outage occurs your business process terminates at unknown stage in an unknown state. So you need to track a state of every single flow in a task graph in some external persistent storage like database so that you can resume it (or roll it back) from the place it crashed. You also need some health monitoring tool that would watch your app and if it crashed schedule unfinished flows on another instance. This is exactly what Mistral can do out of the box without reinventing the wheel for each application time and time again.

Reason 2: Scalability. Most task graphs have steps that can be performed in parallel (i.e. different routes in a graph, flows). Mistral can distribute execution of such tasks across your application’s instances so that the whole execution would scale.

Reason 3: Observable state. Because flow state is tracked outside of application it becomes observable. At any given moment system administrator can access information on what is currently going on, what tasks are in pending state and what has already been executed. You can obtain metrics on your business processes and profile them.

Reason 4: Scheduling. Using Mistral you can schedule your process to be run periodically or at a fixed moment in future. You can have your execution to be triggered on alarm condition from an external health monitoring system or upon a new email in your mailbox.

Reason 5: Dependency management offloading. Because you offload task management to an external service you don’t have to specify all the triggers and actions in advance. For example, you may say “here is the task that must be triggered if my domain is down for 1 minute” without specifying how exactly the event is obtained. System administrator can setup Nagios to watch your domain and trigger the action and replace it later with Ceilometer without your application being affected or even aware of the change. Administrator can even manually trigger the task using CLI or UI console. Or another example is having a task that triggers each time a flow reaches some desired state and let administrator configure what exactly needs to happen there (like send a notification mail and later replace it with SMS).

Reason 6: Open additional points for integration. As soon as your business process is converted to a Mistral task graph that can be accessed by others other application can setup their own workflow to be triggered by your application reaching a certain state. For example suppose OpenStack Nova would declare a workflow for new VM instance spawning. One application (or system administrator) can hook to a task “finish” so that every time Nova spawns another instance you would receive a notification. Or suppose you want your users to have flexible quotas on how many instances one can spawn based on information in external billing system. Normally you would have to patch Nova to access your billing system but with Mistral you can just alter Nova’s task graph so that it includes your custom tasks that would do it instead.

Reason 7: Formalized graphs of tasks are just easier to manage and understand. They can be visualized, analyzed and optimized. They simplify program development and debugging. You can model program workflows, replace task actions with stubs, easily mock external dependencies, do task profiling.

Q: How do I make Mistral know about my task graphs?
A: Task graphs are described using the DSL. Currently YAML is considered the primary syntax for Mistral DSL, however, other alternatives like JSON or XML can also be supported. There is a REST API that is used to upload task graphs, execute them and do run-time modifications against against them. DSL describes Tasks. Dependencies between tasks (what tasks need to be run before this task can be executed). Triggers that start execution upon some conditions.

Q: What exactly are Mistral tasks?
A: Tasks are objects. Each such object has: Name. Optional tag names. List of tasks it depends on. This can be both a fixed list or a YAQL expression. See https://pypi.python.org/pypi/yaql for what is YAQL. Basically, it’s just a selector specifying the tasks this task depends on. For example, it may be built using task tag names. Optional YAQL expression that extracts data from current data context so that it would go as a task execution input. Optional task action (a signal to notify a worker to do some actual work).

Q: What are Mistral workflows?
A: Interdependent tasks form a structure known as graph. Workflow just describes what exactly in this graph should be run for achieving user’s goal (i.e. the whole graph may contain 20 interrelated tasks describing all possible steps of setting up a cluster but the workflow for spawning a single VM may only include 3 steps which form their own subgraph). When we start workflow execution (open new session) we say what node of that graph needs to be reached and Mistral walks all possible paths (executes independent parallel flows) to that node (task) executing all the tasks that are within those paths.

Q: What are Mistral actions and how does Mistral execute them?
A: Action is what to do when an exact task is triggered. Mistral cannot execute some domain specific actions. Neither can a user upload his code. Instead Mistral defines a set of common generic actions that can be used to signal your application to do the real task action. Those are: Call your app’s URI. Send an AMQP (RabbitMQ) message to some queue. Other types of signaling (email, UDP message, polling etc.).
Mistral can be extended to include other general purpose actions like Calling Puppet, Chef, Murano, SaltStack etc. Executing some generic REST API calls. Remote script execution via SSH. etc.
All Mistral actions must: Be generic and universal. No domain specific actions in Mistral. Be secure to be executed on shared servers. Not block (at least for significant time). Ideally be asynchronous.

Q: Is it possible to organize a data flow between different tasks in Mistral?
A: Yes, tasks belonging to the same task graph can take some input as a json structure (other formats are also possible), query a subset of this structure interesting for this particular task using YAQL expression (https://pypi.python.org/pypi/yaql) and pass it along with a corresponding action to a worker. Once the worker has done its processing it returns the result back using similar json format. So in this case Mistral acts as a data flow hub dispatching results of one tasks to inputs of other tasks.

Q: Does Mistral provide a mechanism to run nested workflows?
A: Instead of performing a concrete action associated with a task Mistral can start a nested workflow. That is, given the input that came into the task Mistral takes a new task graph and starts a new workflow with that input and after completion execution jumps back to the parent flow and continues from the same point. The closest analogy in programming would be calling one method from another passing all required parameters and optionally getting back a result. It’s worth noting that the nested workflow works in parallel with the rest of the activities belonging to the parent execution and it has its own isolated execution context.

Q: What are some other potential Mistral capabilities?
A: The team is also considering some other capabilities that may be implemented in Mistral or on top the base functionality in a form toolsets and frameworks: Manage task processing collocation within a cluster. Tasks priorities. Subscribing to Mistral events for arbitrary passive listeners. Namespaces (domains) to logically isolate task graphs from each other. Role Based Access Control for managing and executing workflows. Ability to start dedicated worker VMs able to perform a set of predefined (or configured) actions like executing a specified script or any arbitrary code (in Python, Java etc.). That may be targeted to use cases when a user needs to do some sort of parallel execution on a temporarily created cluster. For example, we may want to process a set of objects residing in Swift using 100 temporary worker VMs so that we can logically split this set of objects into 100 segments and let the workers process them individually. Plugin system that would allow to introduce additional means into DSL and REST API via custom plugins (say we use a plugin for connecting to Mule ESB using namespace “mule:” in DSL).

Q: Who are Mistral users?
A: Potential Mistral users are: Developers. Both who work on OpenStack services and those running in tenant’s VMs. Developers use Mistral DSL/API to access it. System integrators. They customize task graphs related with deployment using either special scripts or manually using Mistral CLI/UI. System administrators can use Mistral via additional toolset for common administrative tasks. This can be distributed cron, mass deployment tasks, backups etc.

Q: How does Mistral relate to OpenStack?
A: Although Mistral is quite generic it is built to become a natural part of OpenStack ecosystem. We are going to write Heat HOT templates for its installation, add support for it in Murano and have integration with Keystone. There also might be extensions (plugins) for Mistral that directly expose functionality provided by other OpenStack services like Trove or Heat.

Q: Is Mistral going to be an OpenStack infrastructure-layer service (as Nova) or be deployed on user VMs inside OpenStack?
A: Both use cases are valid and we are going to support both scenarios.

Q: Why not just use TaskFlow?
A: Mistral and TaskFlow have many similarities but target different use cases. TaskFlow is a Python library that you can use inside your Python app to manage Python workflows. Mistral is an out-of-process service that is language-agnostic and cannot execute some arbitrary Python code directly as TaskFlow does. But as an external service it can have distributed task execution, scalability and HA.

Under the hood TaskFlow library can be used for Mistral implementation. We also plan to develop a TaskFlow engine that would help scheduling TaskFlow tasks over Mistral.

Q: How does Mistral relate to Convection?
A: We believe that Mistral is a Convection implementation that goes far beyond the initial proposal to address additional use cases. We closely work with TaskFlow team who are also the people behind Convection. Convection as a project was never started and Mistral was designed to take its place although under different name for trademark reasons.

Q: Why not use Celery?
A: While Celery is distributed task engine it was designed to execute custom Python code on preinstalled private workers. Again this is a different use case with Mistral which assumes the tasks can be executed on a shared service and do not require (or allow) custom code upload. In other words, Celery itself could be implemented on top of Mistral if it started now.

Q: How does Mistral relate to Amazon SWF?
A: Amazon SWF shares many ideas with Mistral but, in fact, is designed to be language-oriented (Java, Ruby, Python). It is hard and mostly meaningless to use SWF without its, for example, Java SDK that exposes its functionality as a set of Java annotations and interfaces. In this sense SWF is closer to Celery than to Mistral. Mistral on the other hand wants to be both simpler and more user-friendly. We want to have a service that is usable without an SDK in any programming language. At the same time it’s always possible to implement additional convenient language-oriented bindings based on cool features like Python decorators, Java annotations and aspects.

At later stages Mistral may include SWF API adapter so that SWF applications may be migrated to Mistral.

Mistral

Contents

Mistral

Use cases

Tasks Scheduling - Cloud Cron

Problem Statement

Solution

Notes

Live migration

Long-running business process

BigData analysis & reporting

Rationale

Terminology

Design

Implementation

Links & IRC

FAQ