Jump to: navigation, search

Difference between revisions of "QoS"

(Compute)
(Network)
 
(15 intermediate revisions by the same user not shown)
Line 16: Line 16:
 
== The First Step: Monitor ==
 
== The First Step: Monitor ==
  
=== Compute ===
+
=== Monitor ===
 +
 
 +
==== Compute ====
  
 
The following aspects should fall into compute:
 
The following aspects should fall into compute:
Line 24: Line 26:
 
#* The frequency of each CPU
 
#* The frequency of each CPU
 
#* The number of all vcpus and the vcpus per VM.
 
#* The number of all vcpus and the vcpus per VM.
#* The CPU error rate got by RAS.
+
#* The CPU error rate/number got by RAS.
 
# CPU Cache
 
# CPU Cache
 
#* The total CPU cache size per host
 
#* The total CPU cache size per host
Line 39: Line 41:
 
#* TBD
 
#* TBD
  
=== Network ===
+
==== Network ====
  
 
For network, the following aspects should be considered:
 
For network, the following aspects should be considered:
Line 51: Line 53:
 
# Error rate
 
# Error rate
  
=== Storage ===
+
==== Storage ====
  
 
For storage, the following aspects can be measured:
 
For storage, the following aspects can be measured:
Line 67: Line 69:
 
#* Average of response times for each successful request
 
#* Average of response times for each successful request
  
== The Second Step: Ensure Quality ==
+
== The Second Step: Quality Assurance ==
  
 
=== Resource Reservation ===
 
=== Resource Reservation ===
Line 75: Line 77:
 
==== Compute ====
 
==== Compute ====
  
Regarding the resource is fixed after deploying the cloud, a possible way to add more resource to a VM is to live-migrate the VMs to other hosts. What is the policy of finding a host to locate the VM? That
+
# CPU
 +
#* Pin some physical CPUs to some specific critical VMs
 +
#* Allocate a number of vcpus and reserve some of them, assign them to critical VMs when needed
 +
#* CPU hotplug
 +
#* vcpu hotplug
 +
# Memory
 +
#* Memory hotplug
 +
#* Use balloon driver
 +
# Generic
 +
#* Live migration (incl. resize): live-migrate the VMs to other hosts which have more resources. (Need to think about what is the policy of finding a best host to locate the VM?
 +
#** UBS: When CPU utilization on a host is higher than a number, consider to do live-migration for some VMs
 +
#** UBS: When power consumption on a host is higher than a number, consider to do live-migration for some VMs
 +
 
 +
==== Network ====
 +
 
 +
#* Increasing the bandwidth by ???
 +
#* Live migration
 +
 
 +
==== Storage ====
 +
 
 +
#* Increasing response time by "smarter resource placement" (http://summit.openstack.org/cfp/details/33)
 +
#* Live migration (incl. resize)
  
 
=== Resource Limitation ===
 
=== Resource Limitation ===
 +
 +
==== Compute ====
 +
 +
# CPU
 +
#* Limit the number of the total vcpus allocated.
 +
#* Limit the maximum number of the VMs which are running on a host according to physical sockets/cores/threads/frequency/cache size
 +
#* Limit the number of the vcpus being assigned to each VM
 +
#* Limit the CPU cache size for a VM
 +
#* UBS - When CPU utilization on a host is higher than a number, stop to schedule new VMs onto the host
 +
#* UBS - When power consumption on a host is higher than a number, stop to schedule new VMs onto the host
 +
# Memory
 +
#* Limit memory *** for each VM
 +
 +
==== Network ====
 +
 +
#* ??? How to limit network bandwidth for a VM?
 +
 +
==== Storage ====
 +
 +
#* Quota
 +
 +
=== Alarm ===
 +
 +
When anything goes wrong, e.g. the temperature on a chipset exceeds a threshold or the error ratio exceeds an acceptable value, alerm system reuses Ceilometer to notify admins or users for taking actions.
  
 
== Contact ==
 
== Contact ==
  
 
If you're interested in the project for OpenStack QoS, please email to shane.wang <at> intel.com
 
If you're interested in the project for OpenStack QoS, please email to shane.wang <at> intel.com

Latest revision as of 08:50, 17 January 2014

Introduction

Quality of Service (QoS) is the definition for the overall performance of a telephony or compute network, particularly the performance seen by the users of the network. With the development of OpenStack, there are a lot of services being splitted out and working together to provide infrastructure as a service. As a service, it is the time to think about the quality of the service, i.e., OpenStack QoS.

OpenStack provides IaaS to users, so there exists the problems on how to measure the quality of the service, how to make sure the quality of each component, and so on. Likewise, to quantitatively measure quality of service, the normal related aspects are considered, such as error rates (or success rates), bandwidth, throughput, transmission delay, availability, jitter, and so on.

Problem Statement

In OpenStack, we have 3 major components to provider different services to IaaS: compute, network and storage. Also, in OpenStack, we have 2 sorts of users: cloud admins and cloud users. For different users, what they expect is different.

For cloud users, they expect their applications or services run smoothly and safely. For instance, they expect to access the VM with remote desktop tool like VNC at an acceptable speed, which requires the network bandwidth of the VM is good and the transmission delay is short. If they expect the services in the VM to be run at an acceptable speed, that will require CPU as a resource is allocated reasonably. If any of the applications or the services is running on databases, that might require reading or writing storage is fast.

For cloud admins, they rarely consider the quality of a specific VM but the overall cloud. For instance, they might consider how fast the Horizon UI is, throughput of the RabbitMQ, success ratio of nova scheduler, whether the number of VMs on a host is so huge that has significant impact on other VMs, the overall power consumption of the cloud, and so on.

The First Step: Monitor

Monitor

Compute

The following aspects should fall into compute:

  1. CPU
    • The number of sockets, cores and threads per host.
    • The number being used per host (regarding pin)
    • The frequency of each CPU
    • The number of all vcpus and the vcpus per VM.
    • The CPU error rate/number got by RAS.
  2. CPU Cache
    • The total CPU cache size per host
    • The total cache size being used per host.
    • The CPU cache size being used per VM
  3. CPU Utilization
    • The overall CPU utilization per host
    • (Hard to tell how much utilization for a VM, can we?)
  4. Power Consumption
    • The power consumption on a host
    • (Hard to tell how much power consumed by a VM!)
  5. Memory
    • The memory error rate got by RAS.
    • TBD

Network

For network, the following aspects should be considered:

  1. Bandwidth
    • The bandwidth in theory per host
    • The number of NIC cards per host
    • The number of NIC cards being used per host
    • The number of all VFs and the VFs per VM
  2. Rx/Tx packages and bytes in a period of time
  3. Transmission delay
  4. Error rate

Storage

For storage, the following aspects can be measured:

  1. Success Ratio (%)
    • Ratio of successful operations
    • Successful requests/total number of requests
  2. Throughput (Op/s)
    • Operations completed in a second
    • Successful requests/total run time
  3. Bandwidth (MiB/s)
    • Total data transferred in a second
    • Total bytes transferred/total run time
  4. Response time (ms)
    • Duration between operation initiation and completion
    • Average of response times for each successful request

The Second Step: Quality Assurance

Resource Reservation

In order to ensure the service quality, we can add more resources.

Compute

  1. CPU
    • Pin some physical CPUs to some specific critical VMs
    • Allocate a number of vcpus and reserve some of them, assign them to critical VMs when needed
    • CPU hotplug
    • vcpu hotplug
  2. Memory
    • Memory hotplug
    • Use balloon driver
  3. Generic
    • Live migration (incl. resize): live-migrate the VMs to other hosts which have more resources. (Need to think about what is the policy of finding a best host to locate the VM?
      • UBS: When CPU utilization on a host is higher than a number, consider to do live-migration for some VMs
      • UBS: When power consumption on a host is higher than a number, consider to do live-migration for some VMs

Network

    • Increasing the bandwidth by ???
    • Live migration

Storage

Resource Limitation

Compute

  1. CPU
    • Limit the number of the total vcpus allocated.
    • Limit the maximum number of the VMs which are running on a host according to physical sockets/cores/threads/frequency/cache size
    • Limit the number of the vcpus being assigned to each VM
    • Limit the CPU cache size for a VM
    • UBS - When CPU utilization on a host is higher than a number, stop to schedule new VMs onto the host
    • UBS - When power consumption on a host is higher than a number, stop to schedule new VMs onto the host
  2. Memory
    • Limit memory *** for each VM

Network

    •  ??? How to limit network bandwidth for a VM?

Storage

    • Quota

Alarm

When anything goes wrong, e.g. the temperature on a chipset exceeds a threshold or the error ratio exceeds an acceptable value, alerm system reuses Ceilometer to notify admins or users for taking actions.

Contact

If you're interested in the project for OpenStack QoS, please email to shane.wang <at> intel.com