Jump to: navigation, search

Trove/event simulator

< Trove
Revision as of 19:51, 24 September 2014 by Tim Simpson (talk | contribs) (Created page with "The event simulator is a module used by the Trove integration tests to simulate time and run the tests as quickly as possible while also making sure they run deterministically...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The event simulator is a module used by the Trove integration tests to simulate time and run the tests as quickly as possible while also making sure they run deterministically.

Fake Mode Recap

To explain the event simulator, a recap of Trove's "fake mode" is in order. Fake mode is just a special configuration of Trove where certain major components, such as the Nova API client, are replaced by long lived test doubles. These "fakes" are more complex than a mock in that they are long lived and work to present a consistent facade of behavior in the components they represent.

For example, when a "fake" Nova server is provisioned, it actually creates an object in a dictionary which is marked as being built. Requests to "show" this object using the Nova client's servers.get method will return a server in a building status. The creation of the fake server will also spawn an eventlet thread that will, after a few seconds, update the status of the fake server object to "active." Because of this, Trove can use this fake Nova client in place of the real thing. Similar fakes are created for Cinder, the guestagent, and other major components.

Because the fake objects need to wait a bit of time before changing their state this means that fake mode takes awhile to run even though nothing is happening. This is exacerbated by the tests, which need to poll the status of Trove objects repeatedly, sleeping each time in between.

One solution would be to rewrite all of the tests to never sleep, and change the fake objects accordingly. This could be done at the expense of being unable to use the same tests against a real deployment of Trove, and conversely being unable to make sure the real tests written for Trove work without running against a real deployment, which may be expensive in terms of time and resources.

This is where the event simulator comes in.

Event Simulator

The root of the problem is time.sleep is getting called and waiting for actions which themselves are being delayed for no reason other than to make sure the tests can order themselves correctly.

Event Simulator fixes things (where "fix" is meant in the Vegas sense of the word) by monkey patching the time.sleep calls as well as calls to spawn eventlet threads.

When the app asks to launch a thread, the event simulator places the function that would be run into a "fake_thread" of which it keeps a collection. The initial fake thread ends up being the tests themselves. All of these fake threads are managed by a central loop (kind of like a reactor pattern). Each thread is run if and only if it doesn't need to sleep anymore. Fake threads acknowledge they wish to sleep using the fake_sleep method of event_simulator which replaces time.sleep during monkey_patching.

This by itself would not work if the code running for these fake threads were associated to regular eventlet threads which might run wherever.

To fix that, and the second problem being addressed, which is that test runs won't be deterministic (i.e. speed of the test machine and other factors could lead to threads running in a different order), event_simulator wraps all of the threads it runs into simulated coroutines. This means only one fake thread gets run at any given time.

The way this works is every eventlet thread is wrapped in a event_simulator.Coroutine object, which waits on a semaphore to run. Code which runs the coroutine is in a different thread, and as soon as it asks the coroutine to run waits for its own semaphore. When the running coroutine finishes or wishes to sleep, it releases the calling thread's semaphore and goes back to waiting on it's own (greenthreads couldn't be used as eventlet itself is built off of them leading to unpredictable results).

The end result is that only one thread of execution ever executes at a time, and that time.sleep *must* be called for events to proceed in fake mode. For example, if you did a busy wait by simply polling for a resource nothing would ever happen and the tests would hang.

Repl Demos

It can be easier to understand how this works if you run a repl to experiment.

Start by entering your Trove directory and running the tests using tox, by running "tox -epy27".

Once they finish, run the following:

   .tox/py27/bin/python run_tests.py --stop --group=dbaas.guest.initialize --nocapture --repl

This will execute the run_tests.py script directly. The --repl tells the code to enter an interactive REPL loop as soon as the tests finish. Because the group "nadda" does not exist no tests will be executed. However it can be useful to run a REPL if you're trying to debug test code.

When the repl starts, enter these lines:

   >>> from trove.tests.util import test_config
   >>> from trove.tests.util import create_dbaas_client
   >>> from trove.tests.util.users import Requirements
   >>> reqs = Requirements(is_admin=True)
   >>> user = test_config.users.find_user(reqs)
   >>> client = create_dbaas_client(user)
   >>> instance = client.instances.create("Test Instance", 1, {'size':1})
   >>> instance.status
   u'BUILD'

The last line should cause the status of the instance to be spit out, which will be building.

Try this:

   >>> instance = client.instances.get(instance.id)
   >>> instance.status
   u'BUILD'

Notice no matter how long you run those last two lines in real life, the status will always be building.

Now try this:

   >>> import time
   >>> time.sleep(3)
   CREATING os_admin @ %
   >>> instance = client.instances.get(instance.id)
   >>> instance.status
   u'BUILD'
   >>> time.sleep(30)
   >>> instance = client.instances.get(instance.id)
   >>> time.sleep(30)
   >>> instance.status
   u'ACTIVE'

The Trove instance eventually does become active, but only after time.sleep, which in this context is just calling the event simulator, is called either enough times or with a high enough argument to pass enough simulated time.

You can use the repl like this in order to explore how Trove works internally and better understand it, which can help not only with testing but also when developing new features.

Here's another example you can paste into the repl which uses the Trove management API to show how the volume, server, and finally guest agent comes online to turn a Trove instance status to ACTIVE:

   import time
   from trove.tests.util import test_config
   from trove.tests.util import create_client
   from trove.tests.util.users import Requirements
   client = create_client(is_admin=False)
   admin_client = create_client(is_admin=True)
   instance = client.instances.create("Test Instance", 1, {'size':1})
   
   def print_status():
       mgmt_instance = admin_client.management.show(instance.id)
       vs = ("None" if mgmt_instance.volume is None 
                    else mgmt_instance.volume['status'])
       ns = ("None" if mgmt_instance.server is None 
                    else mgmt_instance.server['status'])
       print("Trove API Status = %s" % mgmt_instance.status)
       print("   volume status = %s" % vs)
       print("   server status = %s" % ns)
       print("    agent status = %s" % mgmt_instance.service_status)
   
   print_status()
   time.sleep(1)
   print_status()
   time.sleep(1)
   print_status()
   time.sleep(1)
   print_status()