Cloudpulse

CloudPulse - Openstack Health Service

Introduction
Cloud applications such as VNFs, VNFMs, NSO have stringent SLA, they need to be highly available with an uptime of 99.9 (99). App availability depends on the cloud infrastructure and hence they need to be aware of health of openstack service. when the infrastructure failure is detected early, these applications can be moved to a different cloud and the cloud operators can be notified. The key take away here is catching and handling failures before customer experiences an application failure.

When is openstack healthy ?

 * 1)  All openstack services receive queries and reply back with an expected result.
 * 2)  Packets can be sent and received on tenant and external network

Requirements
Provide a tool that checks the health of the cloud.
 * 1) Should be light weight, non disruptive and less  resource intensive.
 * 2) Should provide configurable functional testing
 * 3) Should verify resource states after openstack upgrade
 * 4) Should work on all openstack installs, i.e it should be agnostic to openstack distribution and various deployment models.
 * 5) Should work for both tenants and operators
 * 6) Provide both CLI and API.

Different type of health checks
1. Operator test Requires cloud-admin and operator access.
 * Check all services are running and listening on the ports.
 * Check if the docker containers are up in the nodes.
 * Check if the nodes are reachable.
 * Check the galera cluster status.
 * Check if the ceph cluster is in the healthy state.
 * Check the cluster status of infra components rabbit and percona (mysql ‘wsrep’ and rabbitmqctl cluster_status)
 * If Openstack is in HA mode, test the HAProxy and each of the services behind the HAProxy (run 'a' and 'b’)
 * If pacemaker is installed, use 'crm status' or ‘pcs status'

2. Endpoint test 3. Functional test 4. Comprehensive health test 5. Upgrade test
 * keystone service-list
 * glance image-list
 * cinder list
 * nova list
 * neutron net-list
 * login to horizon page
 * Create tenant, create network, upload an image, create two VMs and run ping between the VMs.
 * Create VM, create volume, attach volume to the VM.
 * Detach VM, delete volume and delete VM
 * Clean up all resources
 * Create VM on each compute node and ping the gateway.
 * Determine max MTU and check jumbo packets (optional)
 * Check security groups (ping, ssh and http traffic)
 * Create or snapshot the state of existing openstack resources such as tenants/routers/VMs/Loadbalancers
 * After upgrade check if the created/snapshotted resources are in operational state
 * Check security groups after upgrade (ping, ssh and http)

Application health tests
Application can make use of endpoint, comprehensive, functional and upgrade checks. Application can snapshot the resources before upgrade and then check there state after the upgrade. Cloud Pulse itself can be run as a tenant-vm, which then can provide REST-API access to other NFV, VNFM, NFVO applications.

Operator health tests
Operators can install cloud-pulse in one of the controllers directly or using docker container. They should be able to run all of the health checks listed above.

Extensions
CloudPulse is extensible, both operators and API tests can be added to cloud pulse as a pluggable module. Some of the extensions that are of interest at this time are nagios/ganglia for operators and NFVM specific tests for applications.

Cloudpulse command-line client
https://wiki.openstack.org/wiki/Cloudpulseclient

Developer Pages

 * https://wiki.openstack.org/wiki/Cloudpulse/DeveloperNotes
 * https://wiki.openstack.org/wiki/Cloudpulse/OperatorTests
 * https://wiki.openstack.org/wiki/Cloudpulse/APIDocs