Mistral/Cookbooks/AdministrationCloudCron

= Mistral for Administration (aka Cloud Cron) =

Prerequisites
A reader should be familiar with basic Mistral concepts such as workflow, task, action, cron trigger and YAQL expression language. Please refer to official Mistral documentation at http://docs.openstack.org/developer/mistral/ to get more information on that.

Background
When it comes to managing IT infrastructure such as a cloud or a data center, system administrators typically need to solve a lot of tasks. To name just a few:


 * Update Linux kernel or specific software on all or a subset of servers
 * Re-configure certain software on a subset of servers
 * Crawl data from a subset of servers and build a report based on this data
 * Check health of certain software on a subset of servers or health of servers themselves

It’s worth adding that any of the tasks listed above may need to be done periodically according to a specified schedule. Dealing with them would require a lot of human attention if not using any special software that would allow to automate it.

In this article we’ll take OpenStack cloud tenant as an example of IT infrastructure that a system administrator needs to manage and see how Mistral workflow service can be useful for addressing those cases and why it’s worthwhile to use exactly workflow technology.

Important aspects
So what does it take to solve any of the above problems? Let’s have a look at pretty simple task as upgrading Linux kernel on a single server. It requires the following:


 * Download new linux kernel packages
 * Install packages
 * Reboot the server

Looks pretty simple to do. However, things get more complicated when:


 * We want to do this for multiple servers
 * We need to clearly see which servers have been successfully updated and which haven’t after this sequence is completed on all the servers
 * We need to run this sequence automatically on a periodic basis

For example, if we want to do this kind of automation by just writing a script (as administrators usually do), whether it is a Shell or Python, we’ll quickly see that taking care of these aspects is pretty challenging because in order to do that efficiently it makes sense to process all the servers in parallel, and once all the servers have been processed send a notification with the information showing whether all is fine or there were issues occurred during some of the operations. Additionally, if a script running on a single machine that is responsible for solving this task just fails for whatever reason then the whole process of updating a hundred servers will not complete and end up in an unknown state.



So that shows that we need to take care of at least:


 * Parallel execution
 * Persistent state giving info about what happened with every server (at minimum, success or failure)
 * High availability to make sure the whole thing will complete
 * Notification mechanism so that we don’t have to check the status of the process manually

And, as a matter of fact, this should be repeated every time we need to do something similar. Notification mechanism is not a must if we always want to run this upgrade manually and it doesn’t take long. In case if a human doesn’t control when it starts and/or it takes long then notifications become very important. That all actually means that we most likely need to use an external tool that would take care of these concerns. A workflow technology like Mistral workflow service is exactly the type of tool that can help to deal with those problems.

Mistral-based solution
Let’s now show how we can solve this kind of tasks with Mistral and explore in details how Mistral addresses aforementioned concerns.

Updating Linux kernel on all tenant VMs
As an example, let's see how we can upgrade Linux kernel version on all cloud tenant servers (virtual machines, or just VMs) assuming they all have Ubuntu installed on them. We'll also have some assumptions about how we access guest operating systems which we'll mention separately. In fact, those assumptions don't change much from overall approach perspective so that it remains applicable if we alter some details as using a different operating system, not Ubuntu.

This use case is fairly simple but it demonstrates the essential advantages of using a workflow technology.

Initial workflow
The central Mistral concept is workflow so first of all, we need to create a Mistral workflow that contains the logic of updating Linux kernel on multiple tenant servers. Let’s create a text file named update_kernel.yaml in any convenient text editor:

--- version: '2.0' upgrade_kernel: input: - username: ubuntu - private_key_filename - gateway_host tasks: get_hosts: action: nova.servers_list publish: hosts: <% task(get_hosts).result.select({ip => $.addresses.get($.addresses.keys.first).where($.get("OS-EXT-IPS:type") = fixed).first.addr}).ip %> keep-result: false on-success: - upgrade upgrade: with-items: host in <% $.hosts %> action: std.ssh_proxied input: host: <% $.host %> gateway_host: <% $.gateway_host %> username: <% $.username %> private_key_filename: <% $.private_key_filename %> cmd: "sudo apt-get update && sudo apt-get install linux-image-generic-lts-$(lsb_release -sc) -y && sudo reboot"

This is the simplest version of Mistral workflow that does what we need. Let’s see what it consists of. It has two task definitions: “get_hosts” and “upgrade”.

“get_hosts” calls Nova action “nova.servers_list” that returns information about all servers in a tenant as JSON list. What we really need here is to extract their IP addresses. In order to do that we declare “publish” clause that introduces a new variable in workflow context called “hosts” that will contain a list of IPs. YAQL expression used to extract IP addresses is pretty tricky here just for how Nova structures networking information.

NOTE: it’s easy to see in what form Nova returns info about a server just by running: $ mistral run-action nova.servers_get '{"server": ""}'

It’s worth noting that since in Mistral a result of a task is a result of its action (or workflow) we use special task property “keep-result” assigned with “false” so that the result doesn’t get stored in workflow context. We do this just because we’re not interested in all information that Nova returns, only IPs are relevant. This makes sense to do because even if we have a tenant with 30 virtual servers all information about them returned by Nova will take ~100 KB of disk space.

Task “upgrade” is where the most interesting things happen. It leverages “with-items” functionality to iterate over a list of server IPs and ssh to each of the servers in order to upgrade kernel. Word “iterate” here doesn’t mean though that processing is sequential. Conversely, here’s the place where Mistral runs kernel upgrade in parallel. Every action execution object for “std.ssh_proxied” is stored in database and keeps state and result of upgrade operation on a certain virtual server.

An attentive reader may have noticed suffix "proxied" in name of action "std.ssh_proxied" and asked "What does it mean? Why not just std.ssh which Mistral also has in its standard action pack?" So now we're getting back to the assumption about the way how we access guest operating system. Mistral, by default, can't really get secure shell access to guest VMs for how cloud isolates management network where all OpenStack services reside from guest networks. In fact, if a server doesn't have a floating IP then any service running in a management network can't get network access to that server, it is simply in a different network. In our particular example, we assume that at least one VM in a tenant has a floating IP address so that it can be used as an ssh-gateway through which we can actually ssh other VMs. That's why we're using special action called "std.ssh_proxied" where "proxied" means that we have a proxy VM to access all tenant VMs.



Mistral is a distributed highly-available system and it’s designed not only to survive infrastructural failures but also keep its workflows running. That’s why we can make sure that such a process automated with a workflow service as Mistral will finish even in case of failures of control system components, which in our case Mistral engine and executors.

Adding notifications
What our workflow is missing is the ability to notify a cloud operator when kernel upgrade has complete on all servers. In order to do that we just need to add one more task, let’s call it “send_success_email”. The full workflow now would look like:

--- version: '2.0' upgrade_kernel: input: - username: ubuntu - private_key_filename - gateway_host - email_info: null # [to_email, from_email, smtp_server, smtp_password] tasks: get_hosts: action: nova.servers_list publish: hosts: <% task(get_hosts).result.select({ip => $.addresses.get($.addresses.keys.first).where($.get("OS-EXT-IPS:type") = fixed).first.addr}).ip %> keep-result: false on-success: - upgrade upgrade: with-items: host in <% $.hosts %> action: std.ssh_proxied input: host: <% $.host %> gateway_host: <% $.gateway_host %> username: <% $.username %> private_key_filename: <% $.private_key_filename %> cmd: "sudo apt-get update && sudo apt-get install linux-image-generic-lts-$(lsb_release -sc) -y && sudo reboot" on-success: - send_success_email: <% $.email_info != null %> send_success_email: action: std.email input: subject: Linux kernel on tenant VMs successfully updated body: | Number of updated VMs: <% $.hosts.len %> -- Thanks from_addr: <% $.email_info.from_email %> to_addrs: [<% $.email_info.to_email %>] smtp_server: <% $.email_info.smtp_server %> smtp_password: <% $.email_info.smtp_password %>

Note that along with task we’ve also added “on-success” clause for “upgrade” task that defines a transition to task “send_success_email” on successful completion of “upgrade”. This transition is conditional: it only works if we passed data needed to send an email as an input parameter. That’s why this new version of workflow has a new input parameter called “email_info”. It’s expected that “email_info” is a data structure that consists of fields “from_email”, “to_email”, “smtp_server” and “smtp_password”.

Uploading workflow to Mistral
Assuming we have installed Mistral client we can upload this workflow to Mistral with the command:

$ mistral workflow-create update_kernel.yaml

Normal output of this command (and most others) shows a table with a newly uploaded workflow. It may look like:

+++--+++ | Name          | Tags   | Input                        | Created at                 | Updated at | +++--+++ | upgrade_kernel | | username=ubuntu, private_... | 2015-10-19 10:32:27       | None       | +++--+++

NOTE: In order to print all available workflows run: $ mistral workflow-list

Running the workflow
Now once Mistral knows about workflow “upgrade_kernel” we can start it by running:

$ mistral execution-create upgrade_kernel input.json

File input.json should contain a workflow input data in JSON such as: {    “private_key_filename”: “my_key.pem”, “gateway_host”: “172.16.74.8” }

Configuring a Cron Trigger
In order to make this workflow run periodically we need to create a cron trigger:

$ mistral cron-trigger-create update_kernel_weekly update_kernel --pattern “0 2 * * mon”

In order to print all active cron triggers run:

$ mistral cron-trigger-list

From now on the workflow we created will be started every Monday at 2.00 am and it will be updating Linux kernel on all servers in a tenant we logged in.

What’s important about Mistral Cron Triggers is that it is also a distributed fault-tolerant mechanism. That means that if a number of Mistral engines crash then cron triggers will keep working because there’s no single point of failure for them.

If we no longer need to upgrade kernel periodically we can just delete the trigger:

$ mistral cron-trigger-delete update_kernel_weekly