Difference between revisions of "Documentation/HypervisorTuningGuide"

Revision as of 04:31, 12 November 2015

About the Hypervisor Tuning Guide

The goal of the Hypervisor Tuning Guide (HTG) is to provide cloud operators with detailed instructions and settings to get the best performance out of their hypervisors.

This guide is broken into four major sections:

CPU
Memory
Network
Disk

Each section has tuning information for the following areas:

Symptoms of being (CPU, Memory, Network, Disk) bound
General hardware recommendations
Operating System configuration
Hypervisor configuration
OpenStack configuration
Instance and Image configuration
Validation, benchmarking, and reporting

How to Contribute

Simply add your knowledge to this wiki page! The HTG does not yet have a formal documentation repository. It's still very much in initial stages.

Understanding Your Workload

I imagine this section to be the most theoretical / high level out of the entire guide.

References

https://docs.mirantis.com/openstack/fuel/fuel-6.1/planning-guide.html#hardware-calculation

CPU

Introduction about CPU.

Symptoms of Being CPU Bound

Raw CPU, past 80%
Idle percentage is less than 20
When load is very high, it's usually a disk IO and not CPU
load can be very tricky to figure out
steal time: when high on the guest, indication that the hypervisor is busy

General Hardware Recommendations

Hyperthreading

Virtual router application is better with HT turned off (network-specific workloads?)
thread policies can also be important (prefer/avoid) - hopefully a mitaka enhancement
NUMA?
- http://docs.openstack.org/developer/nova/testing/libvirt-numa.html
CPU pinning

Notable CPU flags

nested cpu for virtualization within a guest
- may have issues with older kernel version: nested vms would lock up
AVX, SSE4, AES-NI, etc

Operating System Configuration

Linux

exclude cores, dedicate cores / cpus specifically for certain OS tasks
- iso cpu
- see rh blog post below
reasonable increase in performance by compiling own kernels
turn off cpu scaling - run at full frequency
TSC
- x86 Specific - different architectures have different timekeeping mechanisms
- can be virtualised or not
- can run one core slower than another
- clock source
- avoiding jitter - eg asterisk, telephony
- time stamp counter, is it a tec vs hpet thing?

Windows

virtio drivers

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

has numa spanning enabled by default, should be disabled for performance, caveat with restarting instance

OpenStack Configuration

host-passthrough is always faster than host-model or custom
- This needs to have a warning that migrations will be impossible if non-identical compute nodes are added later
- passthrough has caused multiple issues with existing instances during upgrades etc. e.g., apparmor bugs, unable to resume from qemu save files due to unknown model (requires cpu_map.xml edits)

Overcommitting

Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory.
RAM overcommit, particularly with KSM, has a CPU hit as well

Instance and Image Configuration

Describe scenarios where the instance sees a CPU flag but cannot use it.
CPU quotas and shares
- Reported use-case: default of 80% on all flavors, if workloads are very cpu heavy, don't do.
guest kernel scheduler set to "none" (elevator=noop on kernel command line)
- what are the benefits of this? host and guest schedulers don't fight
Hyper-v enlightenment features
Hyper-v gen 2 vms are seen to be faster than gen 1, reason?

Validation, Benchmarking, and Reporting

General Tools

top
vmstat
htop

Benchmarking Tools

phoronix
Benchmark suite like HEP-Spec2006 used in High energy physics for HTC workers node give mark in HS06 (http://w3.hepix.org/benchmarks/doku.php/)
Depends on your workload. Test using the dominant workload that is going to be run on your cloud.
For Java-related workloads, DaCapo is a pretty good benchmark - http://www.dacapobench.org
What about full system simulations? ie: Deploy an entire Hadoop cluster and have it create large simulated loads?
- Did for hadoop with 10+ nodes
- This is a good idea, TeraSort type benchmarks are really useful.
https://github.com/ibmcb/cbtool
http://www.phoronix-test-suite.com/
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker

Metrics

System

CPU: user, system, iowait, irq, soft irq

Instance

nova diagnostics
Do not record per-process stats - explain why
overlaying cputime vs allocated cpu

Memory

Symptoms of Being Memory Bound

OOM Killer
Out of swap

General Hardware Recommendations

ensure numa distribution is balanced
memory speeds, vary by chip

Operating System Configuration

Linux

Kernel Tunables

Transparent Hugepages can go either way depending on workload

KSM

Might often cause performance (CPU) problems, better to turn it off

Windows

Hypervisor Configuration

KVM / libvirt

nova enables ballooning but doesn't actually use it
- nova would need something doing the equivalent of MOM in oVirt to "exercise" the balloon:
- http://www.ovirt.org/MoM
reserved_host_memory_mb (defaults: 512 mb which is too low for the real world)
Turn on/off EPT (see blog post)

Xen

VMWare

Hyper-V

OpenStack Configuration

Overcommitting =

Memory Overcommit & the cost of swapping

Instance and Image Configuration

ensure ballooning is enabled / available
guests cannot see memory speed - not exposed like cpu flags are

Validation, Benchmarking, and Reporting

General Tools

free

Benchmarking

stream

Metrics

System

page in, page out, page scans per second, `free`

Instance

nova diagnostics
virsh

Network

Symptoms of Being Network Bound

from guest: soft irq will be high
high io wait for network-based instance disk
discards on switch

General Hardware Recommendations

Bonding
- LACP vs balance-tlb vs balance-alb
VXLAN offload

Operating System Configuration

Linux

pin send/recv to specific cores
ip forwarding: disable GRO on kernel module (nic driver)
PCI Passthrough
SR-IOV?
- NUMA locality of SR-IOV (and passthrough) devices (pretty much get this for free if you are using NUMATopologyFilter and have a chipset that has locality)
Jumbo frames? 9000 MTU https://paste.fedoraproject.org/284011/14459359/ - for VLANs - source https://access.redhat.com/solutions/1417133

Kernel Tunables

net.ipv4.tcp_keepalive_time, net.core.somaxconn, net.nf_conntrack_max
Different queue algos: FQ_CODEL, etc
What is your conntrack_max?
- 512k
- 256k with hash table size of 16k
- 256k with hash table size of 64k (overkill?)+1
- Recently moved from 64k to 128k because we were hitting the 64k default limit on Ubuntu
- https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1167073
How big of a hashsize?
- 64k (conntrack_max/8)
What is your gc_thresh set to?
- based on memory https://github.com/stackforge/os-ansible-deployment/blob/master/playbooks/roles/openstack_hosts/defaults/main.yml#L77-L83 (and below in that file)
drop down net.netfilter.nf_conntrack_udp_timeout=5
- Use this only for udp request response type traffic which doesn't reuse the udp port. DNS traffic is one candidate for this.

Windows

Hypervisor Configuration

KVM / libvirt

vhost-net (on by default on most modern distros?)
virtio
- virtio multiqueue
ovs acceleration (dpdk)
Out of order frames?

Xen

VMWare

Hyper-V

OpenStack Configuration

Instance and Image Configuration

PCI pass-through
Network IO quotas and shares
- not advanced enough
- instead, using libvirt hooks
1500 MTU
Make sure the instance is actually using vhost-net (load the kernel module)

Validation, Benchmarking, and Reporting

General Tools

iftop

Benchmarking

iperf

Metrics

System

bytes in/out, packets in/out, irqs, pps
/proc/net/protocols

Instance

nova diagnostics
virsh
virtual nic stats

Disk

Symptoms of Being Disk Bound

Artificially high load with high CPU idle
iowait

General Hardware Recommendations

Spindle vs SSD

separate ssd for logs
too many faulty ssd disks
SSD: TRIM, trim requests from guest aren't passed to hypervisor
bcache with ssd
- dmcache was less good

Hardware RAID, Software RAID, no RAID?

raid0 individual disks, pass through
durability: hardware raid5
ensure writes match stripe size of raid, on the filesystem level
raid1 for OS, JBOD for ephemeral
battery backup, will switch from back to through, performance hit at this time

Operating System Configuration

Linux

xfs barriers, turn off for performance, not for database
xfs on raid, tunables
xfs or ext4
LVM?
cfq instead of deadline - workload specific
tuned
File system recommendations and benefits
Caching and in-memory file systems?
bcache, see notes above
Turn off block I/O barrier, set tuned profile to 'virtual-guest'
- Potential data loss if power is lost and data is in guest cache, has not been written to disk, etc
- Guest root mount barrier=0 + Host cache=unsafe if your workload can handle very unsafe data loss potential
  - https://bugs.launchpad.net/openstack-manuals/+bug/1106423

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

ignore sync calls from guest - dangerous, but fast
write-through, write-back
defaults are usually safe

Xen

VMWare

Hyper-V

OpenStack Configuration

Base images, copy on write

Image Formats

qcow: smaller, cow
qcow tunables
- preallocation: full fallocate, metadata, no
- format/version
preallocation: full fallocate, metadata, no
If you are using Redhat, redhat has pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity.
- we disbaled pre-linker to solve this

Overcommit

for ephemeral
for migration

Instance and Image Configuration

tuned
ide vs scsi, scsi didn't have a performance increase
- scsi for supporting Trim, blk for IO performance
ioschedule: noop
Disk IO quotas and shares
- yes on cinder
- question on how to effectively use
turn off mlocate, prelinking

Validation, Benchmarking, and Reporting

Benchmarking

fio (extensive)
bonnie++ (quick)

Metrics

System

iowait
iops
iostats
vmstat
sysstat (sar metrics)

Instance

nova diagnostics
virsh

References

RedHat guides from Steve Gordon
- http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-openstack-compute/
- http://redhatstackblog.redhat.com/2015/09/15/driving-in-the-fast-lane-huge-page-support-in-openstack-compute/

Docs from distributions
- KVM
  - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Virtualization_Tuning_and_Optimization_Guide/index.html
- Xen
  - http://wiki.xenproject.org/wiki/Tuning_Xen_for_Performance

CERN Tuning for high throughput computing

Previous Etherpads
- https://etherpad.openstack.org/p/YVR-ops-hypervisor-tuning
- https://etherpad.openstack.org/p/PAO-ops-hypervisor-tuning

@@ Line 56: / Line 56: @@
 * nested cpu for virtualization within a guest
 ** may have issues with older kernel version: nested vms would lock up
+* AVX, SSE4, AES-NI, etc
 === Operating System Configuration ===
@@ Line 66: / Line 67: @@
 * reasonable increase in performance by compiling own kernels
 * turn off cpu scaling - run at full frequency
+* TSC
+** x86 Specific - different architectures have different timekeeping mechanisms
+** can be virtualised or not
+** can run one core slower than another
+** clock source
+** avoiding jitter - eg asterisk, telephony
+** time stamp counter,  is it a tec vs hpet thing?
 ==== Windows ====
 * virtio drivers
@@ Line 82: / Line 89: @@
 * host-passthrough is always faster than host-model or custom
 ** This needs to have a warning that migrations will be impossible if non-identical compute nodes are added later
+** passthrough has caused multiple issues with existing instances during upgrades etc. e.g., apparmor bugs, unable to resume from qemu save files due to unknown model (requires cpu_map.xml edits)
 ==== Overcommitting ====
 * Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory.
@@ Line 87: / Line 96: @@
 === Instance and Image Configuration ===
+* Describe scenarios where the instance sees a CPU flag but cannot use it.
 * CPU quotas and shares
 ** Reported use-case: default of 80% on all flavors, if workloads are very cpu heavy, don't do.
+* guest kernel scheduler set to "none" (elevator=noop on kernel command line)
+** what are the benefits of this? host and guest schedulers don't fight
 * Hyper-v enlightenment features
 * Hyper-v gen 2 vms are seen to be faster than gen 1, reason?
 === Validation, Benchmarking, and Reporting ===
@@ Line 102: / Line 114: @@
 ==== Benchmarking Tools ====
 * phoronix
+* Benchmark suite like HEP-Spec2006 used in High energy physics for HTC workers node give mark in HS06 (http://w3.hepix.org/benchmarks/doku.php/)
+* Depends on your workload. Test using the dominant workload that is going to be run on your cloud.
+* For Java-related workloads, DaCapo is a pretty good benchmark - http://www.dacapobench.org
+* What about full system simulations? ie: Deploy an entire Hadoop cluster and have it create large simulated loads?
+** Did for hadoop with 10+ nodes
+** This is a good idea, TeraSort type benchmarks are really useful.
+* https://github.com/ibmcb/cbtool
+* http://www.phoronix-test-suite.com/
+* https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
 ==== Metrics ====
@@ Line 192: / Line 213: @@
 * net.ipv4.tcp_keepalive_time, net.core.somaxconn, net.nf_conntrack_max
 * Different queue algos: FQ_CODEL, etc
+* What is your conntrack_max?
+** 512k
+** 256k with hash table size of 16k
+** 256k with hash table size of 64k (overkill?)+1
+** Recently moved from 64k to 128k because we were hitting the 64k default limit on Ubuntu
+** https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1167073
+* How big of a hashsize?
+** 64k (conntrack_max/8)
+* What is your gc_thresh set to?
+** based on memory https://github.com/stackforge/os-ansible-deployment/blob/master/playbooks/roles/openstack_hosts/defaults/main.yml#L77-L83 (and below in that file)
+* drop down net.netfilter.nf_conntrack_udp_timeout=5
+** Use this only for udp request response type traffic which doesn't reuse the udp port. DNS traffic is one candidate for this.
 ==== Windows ====
@@ Line 201: / Line 233: @@
 ** virtio multiqueue
 * ovs acceleration (dpdk)
+* Out of order frames?
 ==== Xen ====
 ==== VMWare ====
@@ Line 264: / Line 297: @@
 * Caching and in-memory file systems?
 * bcache, see notes above
+* Turn off block I/O barrier, set tuned profile to 'virtual-guest'
+** Potential data loss if power is lost and data is in guest cache, has not been written to disk, etc
+** Guest root mount barrier=0 + Host cache=unsafe if your workload can handle very unsafe data loss potential
+*** https://bugs.launchpad.net/openstack-manuals/+bug/1106423
 ===== Kernel Tunables =====
@@ Line 283: / Line 320: @@
 ==== Image Formats ====
 * qcow: smaller, cow
+* qcow tunables
+** preallocation: full fallocate, metadata, no
+** format/version
+* preallocation: full fallocate, metadata, no
+* If you are using Redhat, redhat has pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity.
+** we disbaled pre-linker to solve this
 ==== Overcommit ====
@@ Line 291: / Line 335: @@
 * tuned
 * ide vs scsi, scsi didn't have a performance increase
+** scsi for supporting Trim, blk for IO performance
 * ioschedule: noop
 * Disk IO quotas and shares

Difference between revisions of "Documentation/HypervisorTuningGuide"

Revision as of 04:31, 12 November 2015

Contents

About the Hypervisor Tuning Guide

How to Contribute

Understanding Your Workload

References

CPU

Symptoms of Being CPU Bound

General Hardware Recommendations

Hyperthreading

Notable CPU flags

Operating System Configuration

Linux

Windows

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

OpenStack Configuration

Overcommitting

Instance and Image Configuration

Validation, Benchmarking, and Reporting

General Tools

Benchmarking Tools

Metrics

System

Instance

Memory

Symptoms of Being Memory Bound

General Hardware Recommendations

Operating System Configuration

Linux

Kernel Tunables

KSM

Windows

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

OpenStack Configuration

Overcommitting =

Instance and Image Configuration

Validation, Benchmarking, and Reporting

General Tools

Benchmarking

Metrics

System

Instance

Network

Symptoms of Being Network Bound

General Hardware Recommendations

Operating System Configuration

Linux

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

OpenStack Configuration

Instance and Image Configuration

Validation, Benchmarking, and Reporting

General Tools

Benchmarking

Metrics

System

Instance

Disk

Symptoms of Being Disk Bound

General Hardware Recommendations

Spindle vs SSD

Hardware RAID, Software RAID, no RAID?

Operating System Configuration

Linux

Kernel Tunables

Windows