Difference between revisions of "Documentation/HypervisorTuningGuide"
(→Symptoms of Being CPU Bound) |
(→vmstat) |
||
Line 39: | Line 39: | ||
Compute Nodes that are CPU bound will generally see CPU usage at 80% or higher and '''idle''' usage less than 20%. | Compute Nodes that are CPU bound will generally see CPU usage at 80% or higher and '''idle''' usage less than 20%. | ||
− | ==== | + | ==== CPU Usage ==== |
On Linux-based systems, you can see CPU usage using the <code>vmstat</code> tool. As an example: | On Linux-based systems, you can see CPU usage using the <code>vmstat</code> tool. As an example: |
Revision as of 03:56, 16 November 2015
Contents
- 1 About the Hypervisor Tuning Guide
- 2 Understanding Your Workload
- 3 CPU
- 4 Memory
- 5 Network
- 6 Disk
- 7 References
About the Hypervisor Tuning Guide
The goal of the Hypervisor Tuning Guide (HTG) is to provide cloud operators with detailed instructions and settings to get the best performance out of their hypervisors.
This guide is broken into four major sections:
- CPU
- Memory
- Network
- Disk
Each section has tuning information for the following areas:
- Symptoms of being (CPU, Memory, Network, Disk) bound
- General hardware recommendations
- Operating System configuration
- Hypervisor configuration
- OpenStack configuration
- Instance and Image configuration
- Validation, benchmarking, and reporting
How to Contribute
Simply add your knowledge to this wiki page! The HTG does not yet have a formal documentation repository. It's still very much in initial stages.
Understanding Your Workload
I imagine this section to be the most theoretical / high level out of the entire guide.
References
CPU
Introduction about CPU.
Symptoms of Being CPU Bound
Compute Nodes that are CPU bound will generally see CPU usage at 80% or higher and idle usage less than 20%.
CPU Usage
On Linux-based systems, you can see CPU usage using the vmstat
tool. As an example:
# vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 35576 312096 95752 1104936 0 0 0 1 0 0 0 0 100 0 0
For CPU, there are five key areas to look at:
- r under procs: The number of running processes. If this number is consistently higher than the number of cores or CPUs on your compute node, then there are consistently more jobs being run than your node can handle.
- us under cpu: This is the amount of CPU spent running non-kernel code.
- sy under cpu: This is the amount of CPU spent running kernel code.
- id under cpu: This is the amount of idle CPU.
- st under cpu: This is the amount of stolen CPU. If an instance consistently sees a high st value, then the compute node hosting it might be under a lot of stress.
Load
Linux-based systems have an abstract concept of "Load". A high load means the system is running "hot" while a low load means it's relatively idle. But what number constitutes hot and cold? It varies from system to system. A general rule of thumb is that the system load will equal 1 when a core or CPU is consistently processing a job. Therefore, a normal load is equal to the number of cores / CPUs or less.
However, exaggeratedly high loads (100+) are usually an indication of IO problems and not CPU problems.
Load should not be used as the sole metric when diagnosing potential Compute Node problems. It's best to use Load as an indication to check further areas.
General Hardware Recommendations
Hyperthreading
- Virtual router application is better with HT turned off (network-specific workloads?)
- thread policies can also be important (prefer/avoid) - hopefully a mitaka enhancement
- NUMA?
- CPU pinning
Notable CPU flags
- nested cpu for virtualization within a guest
- may have issues with older kernel version: nested vms would lock up
- AVX, SSE4, AES-NI, etc
Operating System Configuration
Linux
- exclude cores, dedicate cores / cpus specifically for certain OS tasks
- iso cpu
- see rh blog post below
- reasonable increase in performance by compiling own kernels
- turn off cpu scaling - run at full frequency
- TSC
- x86 Specific - different architectures have different timekeeping mechanisms
- can be virtualised or not
- can run one core slower than another
- clock source
- avoiding jitter - eg asterisk, telephony
- time stamp counter, is it a tec vs hpet thing?
Windows
- virtio drivers
Hypervisor Configuration
KVM / libvirt
Xen
VMWare
Hyper-V
- has numa spanning enabled by default, should be disabled for performance, caveat with restarting instance
OpenStack Configuration
- host-passthrough is always faster than host-model or custom
- This needs to have a warning that migrations will be impossible if non-identical compute nodes are added later
- passthrough has caused multiple issues with existing instances during upgrades etc. e.g., apparmor bugs, unable to resume from qemu save files due to unknown model (requires cpu_map.xml edits)
Overcommitting
- Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory.
- RAM overcommit, particularly with KSM, has a CPU hit as well
Instance and Image Configuration
- Describe scenarios where the instance sees a CPU flag but cannot use it.
- CPU quotas and shares
- Reported use-case: default of 80% on all flavors, if workloads are very cpu heavy, don't do.
- guest kernel scheduler set to "none" (elevator=noop on kernel command line)
- what are the benefits of this? host and guest schedulers don't fight
- Hyper-v enlightenment features
- Hyper-v gen 2 vms are seen to be faster than gen 1, reason?
Validation, Benchmarking, and Reporting
General Tools
- top
- vmstat
- htop
Benchmarking Tools
- phoronix
- Benchmark suite like HEP-Spec2006 used in High energy physics for HTC workers node give mark in HS06 (http://w3.hepix.org/benchmarks/doku.php/)
- Depends on your workload. Test using the dominant workload that is going to be run on your cloud.
- For Java-related workloads, DaCapo is a pretty good benchmark - http://www.dacapobench.org
- What about full system simulations? ie: Deploy an entire Hadoop cluster and have it create large simulated loads?
- Did for hadoop with 10+ nodes
- This is a good idea, TeraSort type benchmarks are really useful.
- https://github.com/ibmcb/cbtool
- http://www.phoronix-test-suite.com/
- https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
Metrics
System
- CPU: user, system, iowait, irq, soft irq
Instance
- nova diagnostics
- Do not record per-process stats - explain why
- overlaying cputime vs allocated cpu
Memory
Symptoms of Being Memory Bound
- OOM Killer
- Out of swap
General Hardware Recommendations
- ensure numa distribution is balanced
- memory speeds, vary by chip
Operating System Configuration
Linux
Kernel Tunables
- Transparent Hugepages can go either way depending on workload
KSM
- Might often cause performance (CPU) problems, better to turn it off
Windows
Hypervisor Configuration
KVM / libvirt
- nova enables ballooning but doesn't actually use it
- nova would need something doing the equivalent of MOM in oVirt to "exercise" the balloon:
- http://www.ovirt.org/MoM
- reserved_host_memory_mb (defaults: 512 mb which is too low for the real world)
- Turn on/off EPT (see blog post)
Xen
VMWare
Hyper-V
OpenStack Configuration
Overcommitting
- Memory Overcommit & the cost of swapping
Instance and Image Configuration
- ensure ballooning is enabled / available
- guests cannot see memory speed - not exposed like cpu flags are
Validation, Benchmarking, and Reporting
General Tools
- free
Benchmarking
- stream
Metrics
System
- page in, page out, page scans per second, `free`
Instance
- nova diagnostics
- virsh
Network
Symptoms of Being Network Bound
- from guest: soft irq will be high
- high io wait for network-based instance disk
- discards on switch
General Hardware Recommendations
- Bonding
- LACP vs balance-tlb vs balance-alb
- VXLAN offload
Operating System Configuration
Linux
- pin send/recv to specific cores
- ip forwarding: disable GRO on kernel module (nic driver)
- PCI Passthrough
- SR-IOV?
- NUMA locality of SR-IOV (and passthrough) devices (pretty much get this for free if you are using NUMATopologyFilter and have a chipset that has locality)
- Jumbo frames? 9000 MTU https://paste.fedoraproject.org/284011/14459359/ - for VLANs - source https://access.redhat.com/solutions/1417133
Kernel Tunables
- net.ipv4.tcp_keepalive_time, net.core.somaxconn, net.nf_conntrack_max
- Different queue algos: FQ_CODEL, etc
- What is your conntrack_max?
- 512k
- 256k with hash table size of 16k
- 256k with hash table size of 64k (overkill?)+1
- Recently moved from 64k to 128k because we were hitting the 64k default limit on Ubuntu
- https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1167073
- How big of a hashsize?
- 64k (conntrack_max/8)
- What is your gc_thresh set to?
- based on memory https://github.com/stackforge/os-ansible-deployment/blob/master/playbooks/roles/openstack_hosts/defaults/main.yml#L77-L83 (and below in that file)
- drop down net.netfilter.nf_conntrack_udp_timeout=5
- Use this only for udp request response type traffic which doesn't reuse the udp port. DNS traffic is one candidate for this.
Windows
Hypervisor Configuration
KVM / libvirt
- vhost-net (on by default on most modern distros?)
- virtio
- virtio multiqueue
- ovs acceleration (dpdk)
- Out of order frames?
Xen
VMWare
Hyper-V
OpenStack Configuration
Instance and Image Configuration
- PCI pass-through
- Network IO quotas and shares
- not advanced enough
- instead, using libvirt hooks
- 1500 MTU
- Make sure the instance is actually using vhost-net (load the kernel module)
Validation, Benchmarking, and Reporting
General Tools
- iftop
Benchmarking
- iperf
Metrics
System
- bytes in/out, packets in/out, irqs, pps
- /proc/net/protocols
Instance
- nova diagnostics
- virsh
- virtual nic stats
Disk
Symptoms of Being Disk Bound
- Artificially high load with high CPU idle
- iowait
General Hardware Recommendations
Spindle vs SSD
- separate ssd for logs
- too many faulty ssd disks
- SSD: TRIM, trim requests from guest aren't passed to hypervisor
- bcache with ssd
- dmcache was less good
Hardware RAID, Software RAID, no RAID?
- raid0 individual disks, pass through
- durability: hardware raid5
- ensure writes match stripe size of raid, on the filesystem level
- raid1 for OS, JBOD for ephemeral
- battery backup, will switch from back to through, performance hit at this time
Operating System Configuration
Linux
- xfs barriers, turn off for performance, not for database
- xfs on raid, tunables
- xfs or ext4
- LVM?
- cfq instead of deadline - workload specific
- tuned
- File system recommendations and benefits
- Caching and in-memory file systems?
- bcache, see notes above
- Turn off block I/O barrier, set tuned profile to 'virtual-guest'
- Potential data loss if power is lost and data is in guest cache, has not been written to disk, etc
- Guest root mount barrier=0 + Host cache=unsafe if your workload can handle very unsafe data loss potential
Kernel Tunables
Windows
Hypervisor Configuration
KVM / libvirt
- ignore sync calls from guest - dangerous, but fast
- write-through, write-back
- defaults are usually safe
Xen
VMWare
Hyper-V
OpenStack Configuration
- Base images, copy on write
Image Formats
- qcow: smaller, cow
- qcow tunables
- preallocation: full fallocate, metadata, no
- format/version
- preallocation: full fallocate, metadata, no
- If you are using Redhat, redhat has pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity.
- we disbaled pre-linker to solve this
Overcommit
- for ephemeral
- for migration
Instance and Image Configuration
- tuned
- ide vs scsi, scsi didn't have a performance increase
- scsi for supporting Trim, blk for IO performance
- ioschedule: noop
- Disk IO quotas and shares
- yes on cinder
- question on how to effectively use
- turn off mlocate, prelinking
Validation, Benchmarking, and Reporting
Benchmarking
- fio (extensive)
- bonnie++ (quick)
Metrics
System
- iowait
- iops
- iostats
- vmstat
- sysstat (sar metrics)
Instance
- nova diagnostics
- virsh
References
- RedHat guides from Steve Gordon
- Docs from distributions
- CERN Tuning for high throughput computing
- http://openstack-in-production.blogspot.fr/2015/09/ept-huge-pages-and-benchmarking.html
- http://openstack-in-production.blogspot.fr/2015/08/numa-and-cpu-pinning-in-high-throughput.html
- http://openstack-in-production.blogspot.fr/2015/08/ept-and-ksm-for-high-throughput.html
- http://openstack-in-production.blogspot.fr/2015/08/cpu-model-selection-for-high-throughput.html
- http://openstack-in-production.blogspot.fr/2015/08/openstack-cpu-topology-for-high.html