QoS

Introduction
Quality of Service (QoS) is the definition for the overall performance of a telephony or compute network, particularly the performance seen by the users of the network. With the development of OpenStack, there are a lot of services being splitted out and working together to provide infrastructure as a service. As a service, it is the time to think about the quality of the service, i.e., OpenStack QoS.

OpenStack provides IaaS to users, so there exists the problems on how to measure the quality of the service, how to make sure the quality of each component, and so on. Likewise, to quantitatively measure quality of service, the normal related aspects are considered, such as error rates (or success rates), bandwidth, throughput, transmission delay, availability, jitter, and so on.

Problem Statement
In OpenStack, we have 3 major components to provider different services to IaaS: compute, network and storage. Also, in OpenStack, we have 2 sorts of users: cloud admins and cloud users. For different users, what they expect is different.

For cloud users, they expect their applications or services run smoothly and safely. For instance, they expect to access the VM with remote desktop tool like VNC at an acceptable speed, which requires the network bandwidth of the VM is good and the transmission delay is short. If they expect the services in the VM to be run at an acceptable speed, that will require CPU as a resource is allocated reasonably. If any of the applications or the services is running on databases, that might require reading or writing storage is fast.

For cloud admins, they rarely consider the quality of a specific VM but the overall cloud. For instance, they might consider how fast the Horizon UI is, throughput of the RabbitMQ, success ratio of nova scheduler, whether the number of VMs on a host is so huge that has significant impact on other VMs, the overall power consumption of the cloud, and so on.

Compute
The following aspects should fall into compute:
 * 1) CPU
 * 2) * The number of sockets, cores and threads per host.
 * 3) * The number being used per host (regarding pin)
 * 4) * The frequency of each CPU
 * 5) * The number of all vcpus and the vcpus per VM.
 * 6) * The CPU error rate/number got by RAS.
 * 7) CPU Cache
 * 8) * The total CPU cache size per host
 * 9) * The total cache size being used per host.
 * 10) * The CPU cache size being used per VM
 * 11) CPU Utilization
 * 12) * The overall CPU utilization per host
 * 13) * (Hard to tell how much utilization for a VM, can we?)
 * 14) Power Consumption
 * 15) * The power consumption on a host
 * 16) * (Hard to tell how much power consumed by a VM!)
 * 17) Memory
 * 18) * The memory error rate got by RAS.
 * 19) * TBD

Network
For network, the following aspects should be considered:
 * 1) Bandwidth
 * 2) * The bandwidth in theory per host
 * 3) * The number of NIC cards per host
 * 4) * The number of NIC cards being used per host
 * 5) * The number of all VFs and the VFs per VM
 * 6) Rx/Tx packages and bytes in a period of time
 * 7) Transmission delay
 * 8) Error rate

Storage
For storage, the following aspects can be measured:
 * 1) Success Ratio (%)
 * 2) * Ratio of successful operations
 * 3) * Successful requests/total number of requests
 * 4) Throughput (Op/s)
 * 5) * Operations completed in a second
 * 6) * Successful requests/total run time
 * 7) Bandwidth (MiB/s)
 * 8) * Total data transferred in a second
 * 9) * Total bytes transferred/total run time
 * 10) Response time (ms)
 * 11) * Duration between operation initiation and completion
 * 12) * Average of response times for each successful request

Resource Reservation
In order to ensure the service quality, we can add more resources.

Compute

 * 1) CPU
 * 2) * Pin some physical CPUs to some specific critical VMs
 * 3) * Allocate a number of vcpus and reserve some of them, assign them to critical VMs when needed
 * 4) * CPU hotplug
 * 5) * vcpu hotplug
 * 6) Memory
 * 7) * Memory hotplug
 * 8) * Use balloon driver
 * 9) Generic
 * 10) * Live migration (incl. resize): live-migrate the VMs to other hosts which have more resources. (Need to think about what is the policy of finding a best host to locate the VM?
 * 11) ** UBS: When CPU utilization on a host is higher than a number, consider to do live-migration for some VMs
 * 12) ** UBS: When power consumption on a host is higher than a number, consider to do live-migration for some VMs

Network

 * 1) * Increasing the bandwidth by ???
 * 2) * Live migration

Storage

 * 1) * Increasing response time by "smarter resource placement" (http://summit.openstack.org/cfp/details/33)
 * 2) * Live migration (incl. resize)

Compute

 * 1) CPU
 * 2) * Limit the number of the total vcpus allocated.
 * 3) * Limit the maximum number of the VMs which are running on a host according to physical sockets/cores/threads/frequency/cache size
 * 4) * Limit the number of the vcpus being assigned to each VM
 * 5) * Limit the CPU cache size for a VM
 * 6) * UBS - When CPU utilization on a host is higher than a number, stop to schedule new VMs onto the host
 * 7) * UBS - When power consumption on a host is higher than a number, stop to schedule new VMs onto the host
 * 8) Memory
 * 9) * Limit memory *** for each VM

Network

 * 1) * ??? How to limit network bandwidth for a VM?

Storage

 * 1) * Quota

Alarm
When anything goes wrong, e.g. the temperature on a chipset exceeds a threshold or the error ratio exceeds an acceptable value, alerm system reuses Ceilometer to notify admins or users for taking actions.

Contact
If you're interested in the project for OpenStack QoS, please email to shane.wang  intel.com