Break and Crash

Title: Crash scenario generator for Openstack operations.

Benefits: Generating crash scenarios at specific state of an instance or image while performing particular operation is essential for different development and testing activities such as crash recovery testing.

Details Description: Given an operation and state of an instance or image, it should crash exactly on that state while performing the operation. The state of an instance is defined as state of VM (vm_state) and state of executing task (task_state). Similarly state of an image is defined as status of the image object. To crash at the specific state of an instance or image, it kills the process of the service performing that particular operation at that time. Such service can be scheduler, compute, network(aurora or neutron), glance or similar other participating services. Also to clear the possibility of straw message in message bus or its queues, the message bus is also restarted. To break at exact state of instance or image, its state information checked from corresponding database continuously.

This needs to be done in two threads where in one thread it waits to kill the process that will perform the required operation. The other thread triggers the operation to be executed that in turn makes the concerned service to take up the required task. Before forking threads it is required to find out the process id of the concerned service so that it can kill the process immediately. The first thread checks nova or glance (or other equivalent) database continuously to find out the state information of corresponding instance or image. Once it gets the desired states, it kills the process of service performing that task.

For example, if we want to crash an instance when it is spawning, first of all it is required to find the pid of openstack-nova-compute service. Crashing while spawning means the vm_state should be "building" and task_state should be "spawning". Next while the first thread should check "vm_state" and "task_state" of the new instance from "instances" table in "nova" database, the second thread should trigger the corresponding "nova boot" command with given instance name. After sometime, when the control goes to openstack-nova-compute service, it updates the corresponding table row with task_state as "spawning". The first thread keeps on checking the particular raw in the "instances" table and waits to kill when desired states are set. The moment it finds the above state information, it kills openstack-nova-compute service process having earlier obtained process id. Also it restarts the rabbitmq message service to clean up any relevant message left with it.

Since the services are spread across multiple VMs, it is required to programmatically login to remote VMs and perform desired operations. To achieve the same, it might need to modify iptables entry in remote system as well as database configuration file such as pg_hba.conf file for postgresql database. In case SSL is enabled for database operations, requisite certificates have to be configured properly in all participating VMs. Since multiple VMs participating in whole exercise, it is required to have high network connectivity among them so that turnaround time for fetching records from database or executing instructions at remote systems should be performed very fast.