Documentation/HypervisorTuningGuide

About the Hypervisor Tuning Guide

The goal of the Hypervisor Tuning Guide (HTG) is to provide cloud operators with detailed instructions and settings to get the best performance out of their hypervisors.

This guide is broken into four major sections:

CPU
Memory
Network
Disk

Each section has tuning information for the following areas:

Symptoms of being (CPU, Memory, Network, Disk) bound
General hardware recommendations
Operating System configuration
Hypervisor configuration
OpenStack configuration
Instance and Image configuration
Validation, benchmarking, and reporting

How to Contribute

Simply add your knowledge to this wiki page! The HTG does not yet have a formal documentation repository. It's still very much in initial stages.

Understanding Your Workload

I imagine this section to be the most theoretical / high level out of the entire guide.

References

https://docs.mirantis.com/openstack/fuel/fuel-6.1/planning-guide.html#hardware-calculation

CPU

Introduction about CPU.

Symptoms of Being CPU Bound

Compute Nodes that are CPU bound will generally see CPU usage at 80% or higher and idle usage less than 20%.

CPU Usage

On Linux-based systems, you can see CPU usage using the vmstat tool. As an example:

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  35576 312096  95752 1104936    0    0     0     1    0    0  0  0 100  0  0

For CPU, there are five key areas to look at:

r under procs: The number of running processes. If this number is consistently higher than the number of cores or CPUs on your compute node, then there are consistently more jobs being run than your node can handle.
us under cpu: This is the amount of CPU spent running non-kernel code.
sy under cpu: This is the amount of CPU spent running kernel code.
id under cpu: This is the amount of idle CPU.
st under cpu: This is the amount of stolen CPU. If an instance consistently sees a high st value, then the compute node hosting it might be under a lot of stress.

Load

Linux-based systems have an abstract concept of "Load". A high load means the system is running "hot" while a low load means it's relatively idle. But what number constitutes hot and cold? It varies from system to system. A general rule of thumb is that the system load will equal 1 when a core or CPU is consistently processing a job. Therefore, a normal load is equal to the number of cores / CPUs or less.

However, exaggeratedly high loads (100+) are usually an indication of IO problems and not CPU problems.

Load should not be used as the sole metric when diagnosing potential Compute Node problems. It's best to use Load as an indication to check further areas.

General Hardware Recommendations

Simultaneous Multithreading

Simultaneous multithreading (SMT), commonly known as Hyper-threading in Intel CPUs, is a technology that enables the Operating System to see a single core / CPU as two cores / CPUs. This feature can usually be enabled or disabled in the Compute Node's BIOS.

It's important to understand that SMT will not make jobs run faster. Rather, it will allow two jobs to run simultaneously where only one job would have run before. Thus, SMT, in some cases, can increase the amount of completed jobs within the same time span than if it was turned off. CERN has seen a throughput increase of 20% with SMT enabled ([1])

The following guidelines are known for specific use cases:

Enable it for general purpose workloads
Disable it for virtual router applications

Notable CPU Flags

The following CPU flags have special relevance to virtualization. On Linux-based systems, you can see what flags are enabled and functional by doing:

$ cat /proc/cpuinfo

vmx (Intel) and smx (AMD): Hardware virtualization support
- Add something here about nested virtualization support. I think there are requirements from both the CPU and hypervisor?
avx: Advanced Vector Extensions
sse4: Streaming SIMD Extensions 4
aes-ni: Advanced Encryption Standard New Instructions

TBD

thread policies can also be important (prefer/avoid) - hopefully a mitaka enhancement
NUMA?
- http://docs.openstack.org/developer/nova/testing/libvirt-numa.html
CPU pinning

Operating System Configuration

Linux

exclude cores, dedicate cores / cpus specifically for certain OS tasks
- iso cpu
- see rh blog post below
reasonable increase in performance by compiling own kernels
turn off cpu scaling - run at full frequency
TSC
- x86 Specific - different architectures have different timekeeping mechanisms
- can be virtualised or not
- can run one core slower than another
- clock source
- avoiding jitter - eg asterisk, telephony
- time stamp counter, is it a tec vs hpet thing?

Windows

virtio drivers

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

has numa spanning enabled by default, should be disabled for performance, caveat with restarting instance

OpenStack Configuration

CPU Mode and Model

The two most notable CPU-related configuration options in Nova are:

cpu_mode
cpu_model

Both of these items can be read about in detail in the config reference. Additionally, CERN's experience with benchmarking cpu_mode can be found here.

Overcommitting

You can configure Nova to report that the Compute Node has more CPUs than it really does by altering the cpu_allocation_ratio setting on each Compute Node. This setting can either take a whole or fraction of a number. For example:

cpu_allocation_ratio=16.0: Configures Nova to report it has 16 times the number of CPUs than what the Compute Node really has. This is the default.
cpu_allocation_ratio=1.5: Configures Nova to report it has 1.5 times the number of CPUs.
cpu_allocation_ratio=1.0: Effectively disables CPU overcommitting.

Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory (which will be explained in the Memory section of this guide).

Note: You must also make sure scheduler_default_filters contains CoreFilter in order to use cpu_allocation_ratio.

RAM overcommit, particularly with KSM, has a CPU hit as well

Instance and Image Configuration

Describe scenarios where the instance sees a CPU flag but cannot use it.
CPU quotas and shares
- Reported use-case: default of 80% on all flavors, if workloads are very cpu heavy, don't do.
guest kernel scheduler set to "none" (elevator=noop on kernel command line)
- what are the benefits of this? host and guest schedulers don't fight
Hyper-v enlightenment features
Hyper-v gen 2 vms are seen to be faster than gen 1, reason?

Validation, Benchmarking, and Reporting

General Tools

top
vmstat
htop

Benchmarking Tools

phoronix
Benchmark suite like HEP-Spec2006 used in High energy physics for HTC workers node give mark in HS06 (http://w3.hepix.org/benchmarks/doku.php/)
Depends on your workload. Test using the dominant workload that is going to be run on your cloud.
For Java-related workloads, DaCapo is a pretty good benchmark - http://www.dacapobench.org
What about full system simulations? ie: Deploy an entire Hadoop cluster and have it create large simulated loads?
- Did for hadoop with 10+ nodes
- This is a good idea, TeraSort type benchmarks are really useful.
https://github.com/ibmcb/cbtool
http://www.phoronix-test-suite.com/
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker

Metrics

System

CPU: user, system, iowait, irq, soft irq

Instance

nova diagnostics
Do not record per-process stats - explain why
overlaying cputime vs allocated cpu

Memory

Symptoms of Being Memory Bound

In general, the free command can be used to determine the amount of memory used and available. Linux usually reports much more memory being used than in reality. This site offers good information about reading how much memory is available.

Another symptom of being memory bound is running out of swap space. The free command also reports swap usage.

OOM Killer

The Out of Memory Killer is a kernel feature that will reap processes when the system is truly out of memory. You can determine if processes are being reaped by looking for the following in your logs:

Out of memory: Kill process

More information about the OOM Killer can be found here.

General Hardware Recommendations

NUMA Balancing

It's recommended to ensure that each NUMA node has the same amount of memory. If you plan to upgrade the amount of memory in a compute node, ensure the amount is balanced on each node. For example, do not upgrade one node by 16GB and another by 8GB.

Memory Speeds

Memory speeds are known to vary by chip. If possible, ensure all memory in a system is the same brand and type.

Operating System Configuration

Linux

Kernel Tunables

Transparent Hugepage Support: can go either way depending on workload
Memory overcommit
KSM enables identical memory pages to be combined. This is a form of memory deduplication.

Windows

Hypervisor Configuration

KVM / libvirt

libvirt/KVM has memory ballooning support, though Nova does not take advantage of it.

libvirt/KVM also has support for Extended Page Table. Consider enabling or disabling it depending on your workload. For example, having EPT enabled has been seen to impact performance on High Energy Physics applications.

Xen

VMWare

Hyper-V

OpenStack Configuration

You can configure the amount of memory reserved for the compute node (meaning, instances will not have access to it) by setting the reserved_host_memory_mb setting in nova.conf The default is 512mb which has been reported to be too low for real-world use.

Overcommitting

You can configure Nova to overcommit the available amount of memory with the ram_allocation_ratio setting in nova.conf. By default, this is set to 1.5:1, meaning Nova will think you have 1.5x more memory than you really do.

Instance and Image Configuration

At this time, guests cannot see the speed of the memory.

Validation, Benchmarking, and Reporting

General Tools

free
sar

Benchmarking

stream

Metrics

System

sar can provide the following metrics:

page in
page out
page scans
page faults

free can provide the amount of available memory over time.

vmstat can also provide general memory information.

Instance

The nova diagnostics command can be used to display memory usage of individual instances. Keep in mind, though, that since OpenStack cannot "deflate" the virtio memory balloon in libvirt/KVM environments, memory will always be seen to increase until max capacity is reached.

virsh dominfo can also be used to view memory usage in libvirt/KVM environments.

Network

Symptoms of Being Network Bound

Network-bound compute nodes will see symptoms like the following:

On the guest, the softirq metric will be high. softirq can be seen in the 7th column of the cat /proc/stat output.
If your instances' ephemeral disks are stored on a network storage device, you will see a high amount of "IO Wait" time.
You might see discards on your network switches.

General Hardware Recommendations

10gb NICs are recommended over 1gb NICs.

It's generally recommended to use some type of NIC bonding on your compute nodes. LACP is the most common form of bonding, though be aware that it requires configuration on both the Linux side and the upstream network side.

(todo: balance-tlb and balance-alb?)

Modern NICs have features such as VXLAN offloading which should decrease the amount of work required on the compute node iteself.

Operating System Configuration

Linux

CloudFlare has an article on network tuning within Linux. (todo: vet the article, add more references).

Disabling GRO might help increase performance. See the following articles for reference:

- [2]
- [3]

Check the NUMA locality of SR-IOV (and passthrough) devices (pretty much get this for free if you are using NUMATopologyFilter and have a chipset that has locality)

Jumbo frames (9000 MTU) might also provide a performance benefit. It might also be required depending on your network topology and configuration.

Kernel Tunables

net.ipv4.tcp_keepalive_time: time of connection inactivity after which the first keep alive request is sent
net.core.somaxconn: Limits the socket() listen backlog. A higher value can support a higher amount of simultaneous requests.
net.nf_conntrack_max: Increase the connection tracking limit. Hitting this limit will cause packet loss and other odd behavior (such as random ping loss). Common values are anywhere between 64k and 512k.
- You should definitely increase this value if you use nova-network. See here.
/sys/module/nf_conntrack/parameters/hashsize: In addition to to net.nf_conntrack_max, also increase the size of the hash-table where the connection tracking is stored. Common values are anywhere between 16k and 128k
net.netfilter.nf_conntrack_udp_timeout: For UDP request response type traffic which doesn't reuse the UDP port (DNS traffic, for example), lower this value to something like "5".
(todo) Different queue algos: FQ_CODEL,

Windows

Hypervisor Configuration

KVM / libvirt

vhost-net might provide better performance over the virtio driver. If you aren't able to use vhost-net, make sure to at least use the virtio driver. virtio-multiqueue can also increase performance.

If you're using an Open vSwitch-based environment, look into OVS acceleration such as dpdk

Xen

VMWare

Hyper-V

OpenStack Configuration

Instance and Image Configuration

PCI passthrough can be used to give an instance direct access to a NIC.
SR-IOV might also provide benefits.

Network IO quotas and shares
- not advanced enough
- instead, using libvirt hooks
- todo: elaborate on this

Validation, Benchmarking, and Reporting

General Tools

iftop is a top-like tool for network traffic.

Benchmarking

iperf can be considered the standard for network benchmarking

Metrics

System

The collection of /sys/class/net/*/statistics/* files can provide a wealth of network-based metrics. Additionally, /proc/net/protocols can provide further information and metrics.

Instance

nova diagnostics can provide network statistics of the instance.

virsh domiflist and virsh domifstat can also be used to obtain network statistics on KVM/libvirt-based hypervisors.

Disk

Symptoms of Being Disk Bound

Artificially high load with high CPU idle
iowait

General Hardware Recommendations

Spindle vs SSD

separate ssd for logs
too many faulty ssd disks
SSD: TRIM, trim requests from guest aren't passed to hypervisor
bcache with ssd
- dmcache was less good

Hardware RAID, Software RAID, no RAID?

raid0 individual disks, pass through
durability: hardware raid5
ensure writes match stripe size of raid, on the filesystem level
raid1 for OS, JBOD for ephemeral
battery backup, will switch from back to through, performance hit at this time

Operating System Configuration

Linux

xfs barriers, turn off for performance, not for database
xfs on raid, tunables
xfs or ext4
LVM?
cfq instead of deadline - workload specific
tuned
File system recommendations and benefits
Caching and in-memory file systems?
bcache, see notes above
Turn off block I/O barrier, set tuned profile to 'virtual-guest'
- Potential data loss if power is lost and data is in guest cache, has not been written to disk, etc
- Guest root mount barrier=0 + Host cache=unsafe if your workload can handle very unsafe data loss potential
  - https://bugs.launchpad.net/openstack-manuals/+bug/1106423

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

ignore sync calls from guest - dangerous, but fast
write-through, write-back
defaults are usually safe

Xen

VMWare

Hyper-V

OpenStack Configuration

Base images, copy on write

Image Formats

qcow: smaller, cow
qcow tunables
- preallocation: full fallocate, metadata, no
- format/version
preallocation: full fallocate, metadata, no
If you are using Redhat, redhat has pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity.
- we disbaled pre-linker to solve this

Overcommit

for ephemeral
for migration

Instance and Image Configuration

tuned
ide vs scsi, scsi didn't have a performance increase
- scsi for supporting Trim, blk for IO performance
ioschedule: noop
Disk IO quotas and shares
- yes on cinder
- question on how to effectively use
turn off mlocate, prelinking

Validation, Benchmarking, and Reporting

Benchmarking

fio (extensive)
bonnie++ (quick)

Metrics

System

iowait
iops
iostats
vmstat
sysstat (sar metrics)

Instance

nova diagnostics
virsh

References

RedHat guides from Steve Gordon
- http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-openstack-compute/
- http://redhatstackblog.redhat.com/2015/09/15/driving-in-the-fast-lane-huge-page-support-in-openstack-compute/

Docs from distributions
- KVM
  - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Virtualization_Tuning_and_Optimization_Guide/index.html
- Xen
  - http://wiki.xenproject.org/wiki/Tuning_Xen_for_Performance

CERN Tuning for high throughput computing

Previous Etherpads

Documentation/HypervisorTuningGuide

Contents

About the Hypervisor Tuning Guide

How to Contribute

Understanding Your Workload

References

CPU

Symptoms of Being CPU Bound

CPU Usage

Load

General Hardware Recommendations

Simultaneous Multithreading

Notable CPU Flags

TBD

Operating System Configuration

Linux

Windows

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

OpenStack Configuration

CPU Mode and Model

Overcommitting

Instance and Image Configuration

Validation, Benchmarking, and Reporting

General Tools

Benchmarking Tools

Metrics

System

Instance

Memory

Symptoms of Being Memory Bound

OOM Killer

General Hardware Recommendations

NUMA Balancing

Memory Speeds

Operating System Configuration

Linux

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

OpenStack Configuration

Overcommitting

Instance and Image Configuration

Validation, Benchmarking, and Reporting

General Tools

Benchmarking

Metrics

System

Instance

Network

Symptoms of Being Network Bound

General Hardware Recommendations

Operating System Configuration

Linux

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

OpenStack Configuration

Instance and Image Configuration

Validation, Benchmarking, and Reporting

General Tools

Benchmarking

Metrics

System

Instance

Disk

Symptoms of Being Disk Bound

General Hardware Recommendations

Spindle vs SSD