Trove/Blueprints/guest-agent-datastore-control-abstraction

Description

A pretty important part of the role of the Guest Agent in Trove is to start, stop, and manage the software processes or services that constitute the data store.

By way of illustrative example, in the case of MySQL this includes starting, stopping, restarting, and configuring the automatic start of the mysql service.

Each data store implements this in its own unique way.

The current code attempts to provide a basic framework to do this. Some parts of it are in common/operating_system.py and others are in the data store specific implementation.

This blueprint is being submitted in reaction to recent changes (https://review.openstack.org/81914 https://bugs.launchpad.net/trove/+bug/1295362)

Justification/Benefits

Different data stores have a different recommendation for how they would like their services started and stopped.

Service Starting

Different data stores implement their own mechanisms with varying levels of correctness. Refer for example to the case of http://bugs.mysql.com/bug.php?id=44300 where the MySQL server does not properly daemonize. https://review.openstack.org/81914 is an example of a similar situation.

While it is safe and recommended that some services be started with a simple program invocation such as

sudo /usr/sbin/<service>

this may not be the case for all services.

There are also cases where a data store requires multiple services and these need to be started in a specific order.

Service Stopping

The same is also the case with service stopping. While some services prefer the

service <name> stop

and provide the mechanism for that, others prefer that you do something like

<service> --shutdown

At issue is the fact that several people have already attempted to address this problem. the /etc/init.d mechanism in SysV and the upstart mechanism that came later are just two examples of this.

However, not all data stores do not implement these abstractions.

For example, while Cassandra provides a script to make this happen http://wiki.apache.org/cassandra/RunningCassandra and some distributions provide the wrappers that make the service command work http://www.datastax.com/docs/1.0/references/start_stop_ref

service cassandra stop

there is no such mechanism provided by MongoDB which would rather the user do the following (in order of preference) http://docs.mongodb.org/manual/tutorial/manage-mongodb-processes/

mongod --shutdown

kill <mongod process ID>

While the documentation is ambiguous on the issue of preference, messaging a service to perform an orderly shutdown is certainly more palatable than delivering it a SIGTERM and having it attempt to perform the cleanup in that signal handler. Properly implemented these could be analogous but there is no guarantee of that.

The same argument can be made in the case of querying service status. While abstractions like /etc/init.d and upstart provide a framework for doing this, not all data stores implement this.

The issue of process group leaders

The specific case of https://review.openstack.org/81914 and potentially http://bugs.mysql.com/bug.php?id=44300 have to do with whether the script/program being invoked properly called setsid() or whether it relied on the framework provided by /etc/init.d or upstart to do that.

The implementation in common/operating_system.py attempts to make a best guess based on what it finds on the machine running the guest agent. This does not always work as has been seen in the cases described above.

For these reasons, this blueprint proposes that we implement a simple abstraction within the trove guest agent that mirrors the one in /etc/init.d or upstart whereby the person authoring the guest agent provides the proper commands that must be executed in order to perform a well defined set of activities on the service.

The form and format in which this information is provided has not yet been finalized. As agreed (in the mid-cycle meetup in Austin), I'm proposing this blueprint and describing the benefits and merits and not postulating a specific implementation. Once the blueprint is approved in principle, I will give the implementation more thought.

Use Case Requirements

The implementation should provide a mechanism in which each data store can provide to the guest agent the proper command or command sequence to execute in order to perform specific operations for that data store.

Described Operations

service start
service stop
service status
service restart
service autostart enable
service autostart disable

Operations that have multiple commands

As in the case of service stop where documentation for the data store provides a series of hammers to use (larger and larger following the adage, when in doubt use a bigger hammer), the implementation should provide a mechanism for describing this series of hammers.

Operations that are required and those that are optional

While service stop and start are obviously required, restart is reasonably an optional command. status may not be easy to provide for each data store and may also be optional.

The implementation must provide a mechanism whereby some operations may not be implemented for some data stores.

Services that require multiple processes, have specified command sequences

Some services require multiple processes and often have dependencies. The implementation must provide some mechanism in which these can be described.

This may be merely a matter of supporting the specification of a portable shell script that provides all the appropriate machinery; it isn't implied here that the implementation should include some mechanism to describe complex state machines of service status.

Scope

This will affect all data stores. Therefore it will require recoding all existing data store implementations as well as providing best practices for future data stores.

Impacts

Likely to be a bunch of code and not something that should be well whetted. I have no delusions that this is an IceHouse thing.

Configuration

Yes, that's what this is all about.

Database

Does this impact any existing tables? If so, which ones?
Are the changes forward and backward compatible?
Be sure to include the expected migration process

I don't believe it will have to but I'll leave that for later when we get to implementation.

Public API

Does this change any API that an end-user has access to?
Are there any exceptions in terms of consistency with other APIs?

I believe not.

Internal API

Does this change any internal messages between API and Task Manager or Task Manager to Guest