Jump to: navigation, search

Difference between revisions of "Documentation/HypervisorTuningGuide"

(Notable CPU flags)
(OpenStack Configuration)
 
(15 intermediate revisions by 2 users not shown)
Line 22: Line 22:
 
=== How to Contribute ===
 
=== How to Contribute ===
  
Simply add your knowledge to this wiki page! The HTG does not yet have a formal documentation repository. It's still very much in initial stages.
+
The HTG does not yet have a formal documentation repository since it's still very much in initial stages.
 +
 
 +
If you'd like to contribute, simply edit this wiki page! If you're not a fan of wikis, you can email Joe Topjian (joe@topjian.net) with any information that you feel is relevant. Maybe that's still too much typing, though, so if you're subscribed to the [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators openstack-operators] mailing list and your email client will auto-complete the address after a few characters, you can always send the information there.
 +
 
 +
If you'd much prefer sharing any information you have in person, please do so at the Ops mid-cycles and Summit events.
 +
 
 +
Please do not worry about grammar, spelling, or formatting. However, please try to add more than just short-hand notes.
 +
 
 +
====What's Needed?====
 +
 
 +
Right now, here are the most wanted areas:
 +
 
 +
* Information about Hypervisors other than libvirt/KVM.
 +
* Information about operating systems other than Linux.
 +
* Real options, settings, and values that you have found to be successful in production.
 +
* Continue to expand and elaborate on existing areas.
  
 
== Understanding Your Workload ==
 
== Understanding Your Workload ==
Line 79: Line 94:
 
* Enable it for general purpose workloads
 
* Enable it for general purpose workloads
 
* Disable it for virtual router applications
 
* Disable it for virtual router applications
 
==== TBD ====
 
 
* thread policies can also be important (prefer/avoid) - hopefully a mitaka enhancement
 
* NUMA?
 
** http://docs.openstack.org/developer/nova/testing/libvirt-numa.html
 
* CPU pinning
 
  
 
==== Notable CPU Flags ====
 
==== Notable CPU Flags ====
Line 98: Line 106:
 
* <tt>sse4</tt>: [https://en.wikipedia.org/wiki/SSE4 Streaming SIMD Extensions 4]
 
* <tt>sse4</tt>: [https://en.wikipedia.org/wiki/SSE4 Streaming SIMD Extensions 4]
 
* <tt>aes-ni</tt>: [https://en.wikipedia.org/wiki/AES_instruction_set Advanced Encryption Standard New Instructions]
 
* <tt>aes-ni</tt>: [https://en.wikipedia.org/wiki/AES_instruction_set Advanced Encryption Standard New Instructions]
 +
 +
==== TBD ====
 +
 +
* thread policies can also be important (prefer/avoid) - hopefully a mitaka enhancement
 +
* NUMA?
 +
** http://docs.openstack.org/developer/nova/testing/libvirt-numa.html
 +
* CPU pinning
  
 
=== Operating System Configuration ===
 
=== Operating System Configuration ===
Line 128: Line 143:
  
 
=== OpenStack Configuration ===
 
=== OpenStack Configuration ===
* host-passthrough is always faster than host-model or custom
+
 
** This needs to have a warning that migrations will be impossible if non-identical compute nodes are added later
+
==== CPU Mode and Model ====
** passthrough has caused multiple issues with existing instances during upgrades etc. e.g., apparmor bugs, unable to resume from qemu save files due to unknown model (requires cpu_map.xml edits)
+
 
 +
The two most notable CPU-related configuration options in Nova are:
 +
 
 +
# cpu_mode
 +
# cpu_model
 +
 
 +
Both of these items can be read about in detail in the [http://docs.openstack.org/kilo/config-reference/content/kvm.html config reference]. Additionally, CERN's experience with benchmarking <code>cpu_mode</code> can be found [http://openstack-in-production.blogspot.com/2015/08/openstack-cpu-topology-for-high.html here].
  
 
==== Overcommitting ====
 
==== Overcommitting ====
* Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory.  
+
 
 +
You can configure Nova to report that the Compute Node has more CPUs than it really does by altering the <code>cpu_allocation_ratio</code> setting on each Compute Node. This setting can either take a whole or fraction of a number. For example:
 +
 
 +
* <code>cpu_allocation_ratio=16.0</code>: Configures Nova to report it has 16 times the number of CPUs than what the Compute Node really has. This is the default.
 +
* <code>cpu_allocation_ratio=1.5</code>: Configures Nova to report it has 1.5 times the number of CPUs.
 +
* <code>cpu_allocation_ratio=1.0</code>: Effectively disables CPU overcommitting.
 +
 
 +
Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory (which will be explained in the Memory section of this guide).
 +
 
 +
Note: You must also make sure <code>scheduler_default_filters</code> contains <code>CoreFilter</code> in order to use <code>cpu_allocation_ratio</code>.
 +
 
 
* RAM overcommit, particularly with KSM, has a CPU hit as well
 
* RAM overcommit, particularly with KSM, has a CPU hit as well
  
Line 177: Line 208:
  
 
=== Symptoms of Being Memory Bound ===
 
=== Symptoms of Being Memory Bound ===
* OOM Killer
+
 
* Out of swap
+
In general, the <code>free</code> command can be used to determine the amount of memory used and available. Linux usually reports much more memory being used than in reality. [http://www.linuxatemyram.com/ This site] offers good information about reading how much memory is available.
 +
 
 +
Another symptom of being memory bound is running out of swap space. The <code>free</code> command also reports swap usage.
 +
 
 +
==== OOM Killer ====
 +
 
 +
The Out of Memory Killer is a kernel feature that will reap processes when the system is truly out of memory. You can determine if processes are being reaped by looking for the following in your logs:
 +
 
 +
<nowiki>
 +
Out of memory: Kill process
 +
</nowiki>
 +
 
 +
More information about the OOM Killer can be found [https://lwn.net/Articles/317814/ here].
  
 
=== General Hardware Recommendations ===
 
=== General Hardware Recommendations ===
* ensure numa distribution is balanced
+
 
* memory speeds, vary by chip
+
==== NUMA Balancing ====
 +
 
 +
It's recommended to ensure that each NUMA node has the same amount of memory. If you plan to upgrade the amount of memory in a compute node, ensure the amount is balanced on each node. For example, do not upgrade one node by 16GB and another by 8GB.
 +
 
 +
==== Memory Speeds ====
 +
 
 +
Memory speeds are known to vary by chip. If possible, ensure all memory in a system is the same brand and type.
 +
 
 +
More on this?
  
 
=== Operating System Configuration ===
 
=== Operating System Configuration ===
 
==== Linux ====
 
==== Linux ====
 +
Go into depth about NUMA, huge pages, and other Linux/memory areas. Pull from the following articles:
 +
 +
* [https://www.mirantis.com/blog/mirantis-openstack-7-0-nfvi-deployment-guide-huge-pages/ NFVI Deployment Guide - Huge Pages]
 +
* [http://developerblog.redhat.com/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance/ Examining Huge Pages or Transparent Huge Pages performance]
 +
* [https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-cpu.html RHEL Performance Guide - CPU]
 +
* [https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-memory.html RHEL Performance Guide - Memboy]
 +
* [http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed Mysteries of NUMA Memory Management Revealed]
 +
* [https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases Optimizing Linux Memory Management for Low-latency / High-throughput Databases]
 +
 
===== Kernel Tunables =====
 
===== Kernel Tunables =====
* Transparent Hugepages can go either way depending on workload
 
  
===== KSM =====
+
* [https://www.kernel.org/doc/Documentation/vm/transhuge.txt Transparent Hugepage Support]: can go either way depending on workload
* Might often cause performance (CPU) problems, better to turn it off
+
* [https://www.kernel.org/doc/Documentation/vm/overcommit-accounting Memory overcommit]
 +
* [https://www.kernel.org/doc/Documentation/vm/ksm.txt KSM] enables identical memory pages to be combined. This is a form of memory deduplication.
  
 
==== Windows ====
 
==== Windows ====
Line 196: Line 256:
 
=== Hypervisor Configuration ===
 
=== Hypervisor Configuration ===
 
==== KVM / libvirt ====
 
==== KVM / libvirt ====
* nova enables ballooning but doesn't actually use it
+
 
** nova would need something doing the equivalent of MOM in oVirt to "exercise" the balloon:
+
libvirt/KVM has [http://www.linux-kvm.org/page/Projects/auto-ballooning memory ballooning] support, though Nova does not take advantage of it.
** http://www.ovirt.org/MoM
+
 
* reserved_host_memory_mb (defaults: 512 mb which is too low for the real world)
+
libvirt/KVM also has support for Extended Page Table. Consider enabling or disabling it depending on your workload. For example, having EPT enabled has been seen to impact performance on High Energy Physics applications.
* Turn on/off EPT (see blog post)
 
  
 
==== Xen ====
 
==== Xen ====
Line 207: Line 266:
  
 
=== OpenStack Configuration ===
 
=== OpenStack Configuration ===
 +
 +
You can configure the amount of memory reserved for the compute node (meaning, instances will ''not'' have access to it) by setting the <code>reserved_host_memory_mb</code> setting in <code>nova.conf</code> The default is 512mb which has been reported to be too low for real-world use.
 +
 
==== Overcommitting ====
 
==== Overcommitting ====
* Memory Overcommit & the cost of swapping
+
 
 +
You can configure Nova to overcommit the available amount of memory with the <code>ram_allocation_ratio</code> setting in <code>nova.conf</code>. By default, this is set to '''1.5:1''', meaning Nova will think you have 1.5x more memory than you really do.
  
 
=== Instance and Image Configuration ===
 
=== Instance and Image Configuration ===
* ensure ballooning is enabled / available
+
==== Flavor Extra Specs ====
* guests cannot see memory speed - not exposed like cpu flags are
+
* '''hw:mem_pages_size''': Specify the page size to the guest.
 +
 
 +
nova flavor-key m1.small set hw:mem_page_size=2048
 +
 
 +
==== Guest Notes ====
 +
At this time, guests cannot see the speed of the memory.
  
 
=== Validation, Benchmarking, and Reporting ===
 
=== Validation, Benchmarking, and Reporting ===
  
 
==== General Tools ====
 
==== General Tools ====
* free
+
 
 +
* [http://linux.die.net/man/1/free free]
 +
* [http://linux.die.net/man/1/sar sar]
  
 
==== Benchmarking ====
 
==== Benchmarking ====
* stream
+
 
 +
* [https://www.cs.virginia.edu/stream/ stream]
  
 
==== Metrics ====
 
==== Metrics ====
 +
 
===== System =====
 
===== System =====
* page in, page out, page scans per second, `free`
+
 
 +
<code>sar</code> can provide the following metrics:
 +
 
 +
* page in
 +
* page out
 +
* page scans  
 +
* page faults
 +
 
 +
<code>free</code> can provide the amount of available memory over time.
 +
 
 +
<code>vmstat</code> can also provide general memory information.
  
 
===== Instance =====
 
===== Instance =====
* nova diagnostics
+
 
* virsh
+
The <code>nova diagnostics</code> command can be used to display memory usage of individual instances. Keep in mind, though, that since OpenStack cannot "deflate" the virtio memory balloon in libvirt/KVM environments, memory will always be seen to increase until max capacity is reached.
 +
 
 +
<code>virsh dominfo</code> can also be used to view memory usage in libvirt/KVM environments.
  
 
== Network ==
 
== Network ==
  
 
=== Symptoms of Being Network Bound ===
 
=== Symptoms of Being Network Bound ===
* from guest: soft irq will be high
+
 
* high io wait for network-based instance disk
+
Network-bound compute nodes will see symptoms like the following:
* discards on switch
+
* On the guest, the <code>softirq</code> metric will be high. <code>softirq</code> can be seen in the 7th column of the <code>cat /proc/stat</code> output.
 +
* If your instances' ephemeral disks are stored on a network storage device, you will see a high amount of "IO Wait" time.
 +
* You might see discards on your network switches
 +
* You might see many dropped packets on the hypervisor
  
 
=== General Hardware Recommendations ===
 
=== General Hardware Recommendations ===
* Bonding
+
 
** LACP vs balance-tlb vs balance-alb
+
10gb NICs are recommended over 1gb NICs.
* VXLAN offload
+
 
 +
It's generally recommended to use some type of NIC bonding on your compute nodes. LACP is the most common form of bonding, though be aware that it requires configuration on both the Linux side and the upstream network side.
 +
 
 +
(todo: balance-tlb and balance-alb?)
 +
 
 +
Modern NICs have features such as VXLAN offloading which should decrease the amount of work required on the compute node iteself.
  
 
=== Operating System Configuration ===
 
=== Operating System Configuration ===
 +
 
==== Linux ====
 
==== Linux ====
* pin send/recv to specific cores
+
 
* ip forwarding: disable GRO on kernel module (nic driver)
+
CloudFlare has an [https://blog.cloudflare.com/how-to-receive-a-million-packets/ article] on network tuning within Linux. (todo: vet the article, add more references).
* PCI Passthrough
+
 
* SR-IOV?
+
Disabling GRO might help increase performance. See the following articles for reference:
** NUMA locality of SR-IOV (and passthrough) devices (pretty much get this for free if you are using NUMATopologyFilter and have a chipset that has locality)
+
** [https://lwn.net/Articles/358910/]
* Jumbo frames? 9000 MTU https://paste.fedoraproject.org/284011/14459359/ - for VLANs - source https://access.redhat.com/solutions/1417133
+
** [https://access.redhat.com/solutions/20278]
 +
 
 +
Check the NUMA locality of SR-IOV (and passthrough) devices (pretty much get this for free if you are using NUMATopologyFilter and have a chipset that has locality)
 +
 
 +
Jumbo frames (9000 MTU) might also provide a performance benefit. It might also be required depending on your network topology and configuration.
  
 
===== Kernel Tunables =====
 
===== Kernel Tunables =====
* net.ipv4.tcp_keepalive_time, net.core.somaxconn, net.nf_conntrack_max
+
* <code>net.ipv4.tcp_keepalive_time</code>: time of connection inactivity after which the first keep alive request is sent
* Different queue algos: FQ_CODEL, etc
+
* <code>net.core.somaxconn</code>: Limits the <code>socket()</code> listen backlog. A higher value can support a higher amount of simultaneous requests.
* What is your conntrack_max?
+
* <code>net.nf_conntrack_max</code>: Increase the connection tracking limit. Hitting this limit will cause packet loss and other odd behavior (such as random ping loss). Common values are anywhere between 64k and 512k.
** 512k
+
** You should definitely increase this value if you use <code>nova-network</code>. See [https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1167073 here].
** 256k with hash table size of 16k
+
* <code>/sys/module/nf_conntrack/parameters/hashsize</code>: In addition to to <code>net.nf_conntrack_max</code>, also increase the size of the hash-table where the connection tracking is stored. Common values are anywhere between 16k and 128k
** 256k with hash table size of 64k (overkill?)+1
+
* <code>net.netfilter.nf_conntrack_udp_timeout</code>: For UDP request response type traffic which doesn't reuse the UDP port (DNS traffic, for example), lower this value to something like "5".
** Recently moved from 64k to 128k because we were hitting the 64k default limit on Ubuntu
+
* (todo) Different queue algos: FQ_CODEL,
** https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1167073
+
* <code>txqueuelen</code> should be increased on the interface if you are seeing dropped packets
* How big of a hashsize?
 
** 64k (conntrack_max/8)
 
* What is your gc_thresh set to?
 
** based on memory https://github.com/stackforge/os-ansible-deployment/blob/master/playbooks/roles/openstack_hosts/defaults/main.yml#L77-L83 (and below in that file)
 
* drop down net.netfilter.nf_conntrack_udp_timeout=5
 
** Use this only for udp request response type traffic which doesn't reuse the udp port. DNS traffic is one candidate for this.
 
  
 
==== Windows ====
 
==== Windows ====
 
=== Hypervisor Configuration ===
 
=== Hypervisor Configuration ===
 
==== KVM / libvirt ====
 
==== KVM / libvirt ====
* vhost-net (on by default on most modern distros?)
+
 
* virtio
+
[http://www.linux-kvm.org/page/UsingVhost vhost-net] usually provides better performance than just the virtio driver (vhost-net can be thought of as a complementary enhancement to virtio). To enable vhost-net, do:
** virtio multiqueue
+
 
* ovs acceleration (dpdk)
+
$ sudo modprobe vhost_net
* Out of order frames?
+
 
 +
If you aren't able to use vhost-net, make sure to at least use the virtio driver regardless.
 +
 
 +
[http://www.linux-kvm.org/page/Multiqueue virtio-multiqueue] can also increase performance (todo: elaborate).
 +
 
 +
If you're using an Open vSwitch-based environment, look into OVS acceleration such as <code>dpdk</code> (todo: elaborate. [https://github.com/openstack/networking-ovs-dpdk relevant?] [http://openvswitch.org/pipermail/dev/2015-September/060406.html more info?])
 +
 
 
==== Xen ====
 
==== Xen ====
 
==== VMWare ====
 
==== VMWare ====
Line 281: Line 378:
 
=== OpenStack Configuration ===
 
=== OpenStack Configuration ===
 
=== Instance and Image Configuration ===
 
=== Instance and Image Configuration ===
* PCI pass-through
+
 
 +
* [https://wiki.openstack.org/wiki/Pci_passthrough PCI passthrough] can be used to give an instance direct access to a NIC.
 +
* [https://wiki.openstack.org/wiki/SR-IOV-Passthrough-For-Networking SR-IOV] might also provide benefits.
 +
 
 
* Network IO quotas and shares
 
* Network IO quotas and shares
 
** not advanced enough
 
** not advanced enough
 
** instead, using libvirt hooks
 
** instead, using libvirt hooks
* 1500 MTU
+
** todo: elaborate on this
* Make sure the instance is actually using vhost-net (load the kernel module)
 
  
 
=== Validation, Benchmarking, and Reporting ===
 
=== Validation, Benchmarking, and Reporting ===
  
 
==== General Tools ====
 
==== General Tools ====
* iftop
+
[http://www.ex-parrot.com/pdw/iftop/ iftop] is a top-like tool for network traffic.
  
 
==== Benchmarking ====
 
==== Benchmarking ====
* iperf
+
[https://iperf.fr/ iperf] can be considered the standard for network benchmarking
  
 
==== Metrics ====
 
==== Metrics ====
 
===== System =====
 
===== System =====
* bytes in/out, packets in/out, irqs, pps
+
 
* /proc/net/protocols
+
The collection of <code>/sys/class/net/*/statistics/*</code> files can provide a wealth of network-based metrics. Additionally, <code>/proc/net/protocols</code> can provide further information and metrics.
  
 
===== Instance =====
 
===== Instance =====
* nova diagnostics
+
 
* virsh
+
<code>nova diagnostics</code> can provide network statistics of the instance.
* virtual nic stats
+
 
 +
<code>virsh domiflist</code> and <code>virsh domifstat</code> can also be used to obtain network statistics on KVM/libvirt-based hypervisors.
  
 
== Disk ==
 
== Disk ==
  
 
=== Symptoms of Being Disk Bound ===
 
=== Symptoms of Being Disk Bound ===
* Artificially high load with high CPU idle
+
Compute nodes that are disk bound might see extremely high load values in the range of 50+. They will also see a large <code>iowait</code> value (which can be seen using the <code>iostat</code> utility).
* iowait
 
  
 
=== General Hardware Recommendations ===
 
=== General Hardware Recommendations ===
 
==== Spindle vs SSD ====
 
==== Spindle vs SSD ====
* separate ssd for logs
+
Some find using SSD-based disks for logging useful. CERN has tested SSD with <code>bcache</code> and have found it successful.
* too many faulty ssd disks
+
 
 +
Be aware that some have run into too many faulty SSD disks for them to consider them worthwhile. This should not scare you from using SSD disks, just something to keep in mind.
 
* SSD: TRIM, trim requests from guest aren't passed to hypervisor
 
* SSD: TRIM, trim requests from guest aren't passed to hypervisor
* bcache with ssd
 
** dmcache was less good
 
  
 
==== Hardware RAID, Software RAID, no RAID? ====
 
==== Hardware RAID, Software RAID, no RAID? ====
* raid0 individual disks, pass through
+
 
* durability: hardware raid5
+
Some people either don't use a hardware RAID card or will create individual RAID0 drives and pass them through to the compute node. They will then use <code>mdadm</code> to provide the drive resiliency.
* ensure writes match stripe size of raid, on the filesystem level
+
 
* raid1 for OS, JBOD for ephemeral
+
Hardware RAID5 has been mentioned to provide the best durability. Others use RAID1 for the operating system and JBOD for ephemeral. This configuration does not provide resiliency for ephemeral disks, though.
* battery backup, will switch from back to through, performance hit at this time
+
 
 +
For best performance, ensure the filesystem write size matches the RAID stripe size.
 +
 
 +
If you use a hardware RAID card with a battery backup, be aware that if the battery dies and writes switch from "write back" to write through", you will incur a performance hit.
  
 
=== Operating System Configuration ===
 
=== Operating System Configuration ===
 
==== Linux ====
 
==== Linux ====
* xfs barriers, turn off for performance, not for database
+
 
* xfs on raid, tunables
+
XFS and EXT4 are the most common filesystems to use.
* xfs or ext4
+
 
* LVM?
+
For XFS, turn on [http://xfs.org/index.php/XFS_FAQ#Write_barrier_support. barrier] for performance, but not for database-related activity.
* cfq instead of deadline - workload specific
+
 
* tuned
+
The decision to use either the CFQ or deadline kernel schedules is highly workload specific. (todo: elaborate)
* File system recommendations and benefits
+
 
* Caching and in-memory file systems?
+
Caching, of course, can offer great performance benefits, but be aware of the data loss that will incur if the cache is ever lost during an event such as a power failure.
* bcache, see notes above
+
 
* Turn off block I/O barrier, set tuned profile to 'virtual-guest'
+
If your workload can incur data loss, having <code>cache=unsafe</code> and mounting the guest root with <code>barrier=0</code> can increase performance. See [https://bugs.launchpad.net/openstack-manuals/+bug/1106423 here] for information.
** Potential data loss if power is lost and data is in guest cache, has not been written to disk, etc
 
** Guest root mount barrier=0 + Host cache=unsafe if your workload can handle very unsafe data loss potential
 
*** https://bugs.launchpad.net/openstack-manuals/+bug/1106423
 
  
 
===== Kernel Tunables =====
 
===== Kernel Tunables =====
Line 348: Line 446:
 
=== Hypervisor Configuration ===
 
=== Hypervisor Configuration ===
 
==== KVM / libvirt ====
 
==== KVM / libvirt ====
* ignore sync calls from guest - dangerous, but fast
+
 
* write-through, write-back
+
This [https://www-01.ibm.com/support/knowledgecenter/linuxonibm/liaat/liaatbpkvmguestcache.htm article] contains useful information about KVM/libvirt and caching.
* defaults are usually safe
+
 
 +
VirtIO SCSI (<code>virtio-scsi</code>) is a para-virtualized SCSI controller and is a successor to <code>virtio-blk</code>. It enables SCSI-passthrough and, in certain cases, enables the guest to better detect volume disconnects. As well, it sets the instance's devices names to the more standard <code>/dev/sdx</code>. To enable <code>virtio-scsi</code> in the guest, see [https://wiki.openstack.org/wiki/Documentation/HypervisorTuningGuide#Instance_and_Image_Configuration_4 Instance and Image Configuration].
  
 
==== Xen ====
 
==== Xen ====
Line 357: Line 456:
  
 
=== OpenStack Configuration ===
 
=== OpenStack Configuration ===
* Base images, copy on write
+
 
 +
This [http://www.pixelbeat.org/docs/openstack_libvirt_images/ article] is a great reference for the many ways that backing disks can be configured in OpenStack. The many configuration combinations all have advantages and disadvantages depending on your overall storage environment (todo: elaborate).
  
 
==== Image Formats ====
 
==== Image Formats ====
* qcow: smaller, cow
 
* qcow tunables
 
** preallocation: full fallocate, metadata, no
 
** format/version
 
* preallocation: full fallocate, metadata, no
 
* If you are using Redhat, redhat has pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity.
 
** we disbaled pre-linker to solve this
 
  
 +
qcow files are smaller than raw files due to thin-provisioning. qcow also has the advantage of being able to do "copy-on-write" with a backing file.
 +
 +
See the following [https://blueprints.launchpad.net/nova/+spec/preallocated-images blueprint] for ways in which qcow can be configured for performance.
 +
 +
If you are using Redhat, pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity. It's recommended to disable the prelinker to avoid this.
 +
 +
==== Images Type ====
 +
 +
Some positive experience on performance for I/O has been reported using images_type=lvm. This involves setting up a volume group and then allocating the VMs from dedicated block devices to the VM using LVM on the hypervisor.
  
 
==== Overcommit ====
 
==== Overcommit ====
* for ephemeral
+
 
* for migration
+
Disk overcommitting is generally a safe thing to do. Thin-provisioning can increase the amount of available storage by not allocating empty storage.
  
 
=== Instance and Image Configuration ===
 
=== Instance and Image Configuration ===
* tuned
+
 
* ide vs scsi, scsi didn't have a performance increase
+
==== Image Metadata ====
** scsi for supporting Trim, blk for IO performance
+
 
* ioschedule: noop
+
To take advantage of <code>virtio-scsi</code>, add the following key/value pairs to an image:
* Disk IO quotas and shares
+
 
** yes on cinder
+
hw_disk_bus_model=virtio-scsi
** question on how to effectively use
+
hw_scsi_model=virtio-scsi
* turn off mlocate, prelinking
+
hw_disk_bus=scsi
 +
 
 +
==== Guest Notes ====
 +
[https://fedorahosted.org/tuned/ tuned] is a utility that can adaptively configure a system. The "virtual-guest" profile has been known to work well for guests.
 +
 
 +
A performance increase was not seen when switching between IDE and SCSI block drivers. SCSI has support for TRIM, but guest-originated TRIM requests are currently ignored (verify?).
 +
 
 +
[https://wiki.openstack.org/wiki/InstanceResourceQuota Disk IO limits] can be enforced on both ephemeral disks and volumes. We need to determine how to effectively apply these limits, though (help!)
 +
 
 +
On guest images, it's recommended to use the noop scheduler as well as to turn off <code>mlocate</code> and prelinking.
  
 
=== Validation, Benchmarking, and Reporting ===
 
=== Validation, Benchmarking, and Reporting ===
  
 
==== Benchmarking ====
 
==== Benchmarking ====
* fio (extensive)
+
* [http://git.kernel.dk/?p=fio.git;a=summary fio] is a great tool for extensive benchmarks.
* bonnie++ (quick)
+
* [http://www.coker.com.au/bonnie++ bonnie++] is great for quick benchmarks.
  
 
==== Metrics ====
 
==== Metrics ====
 
===== System =====
 
===== System =====
 +
 
* iowait
 
* iowait
 
* iops
 
* iops
Line 398: Line 510:
  
 
===== Instance =====
 
===== Instance =====
* nova diagnostics
+
<code>nova diagnostics</code> can pull the IO activity from an instance's disks.
* virsh
+
 
 +
<code>virsh domblklist</code> and <code>virsh domblkstat</code> can also be used on KVM/libvirt-based hypervisors to pull disk statistics.
  
 
== References ==
 
== References ==

Latest revision as of 20:24, 9 December 2016

Contents

About the Hypervisor Tuning Guide

The goal of the Hypervisor Tuning Guide (HTG) is to provide cloud operators with detailed instructions and settings to get the best performance out of their hypervisors.

This guide is broken into four major sections:

  • CPU
  • Memory
  • Network
  • Disk

Each section has tuning information for the following areas:

  • Symptoms of being (CPU, Memory, Network, Disk) bound
  • General hardware recommendations
  • Operating System configuration
  • Hypervisor configuration
  • OpenStack configuration
  • Instance and Image configuration
  • Validation, benchmarking, and reporting

How to Contribute

The HTG does not yet have a formal documentation repository since it's still very much in initial stages.

If you'd like to contribute, simply edit this wiki page! If you're not a fan of wikis, you can email Joe Topjian (joe@topjian.net) with any information that you feel is relevant. Maybe that's still too much typing, though, so if you're subscribed to the openstack-operators mailing list and your email client will auto-complete the address after a few characters, you can always send the information there.

If you'd much prefer sharing any information you have in person, please do so at the Ops mid-cycles and Summit events.

Please do not worry about grammar, spelling, or formatting. However, please try to add more than just short-hand notes.

What's Needed?

Right now, here are the most wanted areas:

  • Information about Hypervisors other than libvirt/KVM.
  • Information about operating systems other than Linux.
  • Real options, settings, and values that you have found to be successful in production.
  • Continue to expand and elaborate on existing areas.

Understanding Your Workload

I imagine this section to be the most theoretical / high level out of the entire guide.

References

CPU

Introduction about CPU.

Symptoms of Being CPU Bound

Compute Nodes that are CPU bound will generally see CPU usage at 80% or higher and idle usage less than 20%.

CPU Usage

On Linux-based systems, you can see CPU usage using the vmstat tool. As an example:

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  35576 312096  95752 1104936    0    0     0     1    0    0  0  0 100  0  0

For CPU, there are five key areas to look at:

  1. r under procs: The number of running processes. If this number is consistently higher than the number of cores or CPUs on your compute node, then there are consistently more jobs being run than your node can handle.
  2. us under cpu: This is the amount of CPU spent running non-kernel code.
  3. sy under cpu: This is the amount of CPU spent running kernel code.
  4. id under cpu: This is the amount of idle CPU.
  5. st under cpu: This is the amount of stolen CPU. If an instance consistently sees a high st value, then the compute node hosting it might be under a lot of stress.

Load

Linux-based systems have an abstract concept of "Load". A high load means the system is running "hot" while a low load means it's relatively idle. But what number constitutes hot and cold? It varies from system to system. A general rule of thumb is that the system load will equal 1 when a core or CPU is consistently processing a job. Therefore, a normal load is equal to the number of cores / CPUs or less.

However, exaggeratedly high loads (100+) are usually an indication of IO problems and not CPU problems.

Load should not be used as the sole metric when diagnosing potential Compute Node problems. It's best to use Load as an indication to check further areas.

General Hardware Recommendations

Simultaneous Multithreading

Simultaneous multithreading (SMT), commonly known as Hyper-threading in Intel CPUs, is a technology that enables the Operating System to see a single core / CPU as two cores / CPUs. This feature can usually be enabled or disabled in the Compute Node's BIOS.

It's important to understand that SMT will not make jobs run faster. Rather, it will allow two jobs to run simultaneously where only one job would have run before. Thus, SMT, in some cases, can increase the amount of completed jobs within the same time span than if it was turned off. CERN has seen a throughput increase of 20% with SMT enabled ([1])

The following guidelines are known for specific use cases:

  • Enable it for general purpose workloads
  • Disable it for virtual router applications

Notable CPU Flags

The following CPU flags have special relevance to virtualization. On Linux-based systems, you can see what flags are enabled and functional by doing:

$ cat /proc/cpuinfo

TBD

Operating System Configuration

Linux

  • exclude cores, dedicate cores / cpus specifically for certain OS tasks
    • iso cpu
    • see rh blog post below
  • reasonable increase in performance by compiling own kernels
  • turn off cpu scaling - run at full frequency
  • TSC
    • x86 Specific - different architectures have different timekeeping mechanisms
    • can be virtualised or not
    • can run one core slower than another
    • clock source
    • avoiding jitter - eg asterisk, telephony
    • time stamp counter, is it a tec vs hpet thing?

Windows

  • virtio drivers

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

  • has numa spanning enabled by default, should be disabled for performance, caveat with restarting instance

OpenStack Configuration

CPU Mode and Model

The two most notable CPU-related configuration options in Nova are:

  1. cpu_mode
  2. cpu_model

Both of these items can be read about in detail in the config reference. Additionally, CERN's experience with benchmarking cpu_mode can be found here.

Overcommitting

You can configure Nova to report that the Compute Node has more CPUs than it really does by altering the cpu_allocation_ratio setting on each Compute Node. This setting can either take a whole or fraction of a number. For example:

  • cpu_allocation_ratio=16.0: Configures Nova to report it has 16 times the number of CPUs than what the Compute Node really has. This is the default.
  • cpu_allocation_ratio=1.5: Configures Nova to report it has 1.5 times the number of CPUs.
  • cpu_allocation_ratio=1.0: Effectively disables CPU overcommitting.

Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory (which will be explained in the Memory section of this guide).

Note: You must also make sure scheduler_default_filters contains CoreFilter in order to use cpu_allocation_ratio.

  • RAM overcommit, particularly with KSM, has a CPU hit as well

Instance and Image Configuration

  • Describe scenarios where the instance sees a CPU flag but cannot use it.
  • CPU quotas and shares
    • Reported use-case: default of 80% on all flavors, if workloads are very cpu heavy, don't do.
  • guest kernel scheduler set to "none" (elevator=noop on kernel command line)
    • what are the benefits of this? host and guest schedulers don't fight
  • Hyper-v enlightenment features
  • Hyper-v gen 2 vms are seen to be faster than gen 1, reason?


Validation, Benchmarking, and Reporting

General Tools

  • top
  • vmstat
  • htop

Benchmarking Tools

Metrics

System
  • CPU: user, system, iowait, irq, soft irq
Instance
  • nova diagnostics
  • Do not record per-process stats - explain why
  • overlaying cputime vs allocated cpu

Memory

Symptoms of Being Memory Bound

In general, the free command can be used to determine the amount of memory used and available. Linux usually reports much more memory being used than in reality. This site offers good information about reading how much memory is available.

Another symptom of being memory bound is running out of swap space. The free command also reports swap usage.

OOM Killer

The Out of Memory Killer is a kernel feature that will reap processes when the system is truly out of memory. You can determine if processes are being reaped by looking for the following in your logs:

Out of memory: Kill process

More information about the OOM Killer can be found here.

General Hardware Recommendations

NUMA Balancing

It's recommended to ensure that each NUMA node has the same amount of memory. If you plan to upgrade the amount of memory in a compute node, ensure the amount is balanced on each node. For example, do not upgrade one node by 16GB and another by 8GB.

Memory Speeds

Memory speeds are known to vary by chip. If possible, ensure all memory in a system is the same brand and type.

More on this?

Operating System Configuration

Linux

Go into depth about NUMA, huge pages, and other Linux/memory areas. Pull from the following articles:

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

libvirt/KVM has memory ballooning support, though Nova does not take advantage of it.

libvirt/KVM also has support for Extended Page Table. Consider enabling or disabling it depending on your workload. For example, having EPT enabled has been seen to impact performance on High Energy Physics applications.

Xen

VMWare

Hyper-V

OpenStack Configuration

You can configure the amount of memory reserved for the compute node (meaning, instances will not have access to it) by setting the reserved_host_memory_mb setting in nova.conf The default is 512mb which has been reported to be too low for real-world use.

Overcommitting

You can configure Nova to overcommit the available amount of memory with the ram_allocation_ratio setting in nova.conf. By default, this is set to 1.5:1, meaning Nova will think you have 1.5x more memory than you really do.

Instance and Image Configuration

Flavor Extra Specs

  • hw:mem_pages_size: Specify the page size to the guest.
nova flavor-key m1.small set hw:mem_page_size=2048

Guest Notes

At this time, guests cannot see the speed of the memory.

Validation, Benchmarking, and Reporting

General Tools

Benchmarking

Metrics

System

sar can provide the following metrics:

  • page in
  • page out
  • page scans
  • page faults

free can provide the amount of available memory over time.

vmstat can also provide general memory information.

Instance

The nova diagnostics command can be used to display memory usage of individual instances. Keep in mind, though, that since OpenStack cannot "deflate" the virtio memory balloon in libvirt/KVM environments, memory will always be seen to increase until max capacity is reached.

virsh dominfo can also be used to view memory usage in libvirt/KVM environments.

Network

Symptoms of Being Network Bound

Network-bound compute nodes will see symptoms like the following:

  • On the guest, the softirq metric will be high. softirq can be seen in the 7th column of the cat /proc/stat output.
  • If your instances' ephemeral disks are stored on a network storage device, you will see a high amount of "IO Wait" time.
  • You might see discards on your network switches
  • You might see many dropped packets on the hypervisor

General Hardware Recommendations

10gb NICs are recommended over 1gb NICs.

It's generally recommended to use some type of NIC bonding on your compute nodes. LACP is the most common form of bonding, though be aware that it requires configuration on both the Linux side and the upstream network side.

(todo: balance-tlb and balance-alb?)

Modern NICs have features such as VXLAN offloading which should decrease the amount of work required on the compute node iteself.

Operating System Configuration

Linux

CloudFlare has an article on network tuning within Linux. (todo: vet the article, add more references).

Disabling GRO might help increase performance. See the following articles for reference:

Check the NUMA locality of SR-IOV (and passthrough) devices (pretty much get this for free if you are using NUMATopologyFilter and have a chipset that has locality)

Jumbo frames (9000 MTU) might also provide a performance benefit. It might also be required depending on your network topology and configuration.

Kernel Tunables
  • net.ipv4.tcp_keepalive_time: time of connection inactivity after which the first keep alive request is sent
  • net.core.somaxconn: Limits the socket() listen backlog. A higher value can support a higher amount of simultaneous requests.
  • net.nf_conntrack_max: Increase the connection tracking limit. Hitting this limit will cause packet loss and other odd behavior (such as random ping loss). Common values are anywhere between 64k and 512k.
    • You should definitely increase this value if you use nova-network. See here.
  • /sys/module/nf_conntrack/parameters/hashsize: In addition to to net.nf_conntrack_max, also increase the size of the hash-table where the connection tracking is stored. Common values are anywhere between 16k and 128k
  • net.netfilter.nf_conntrack_udp_timeout: For UDP request response type traffic which doesn't reuse the UDP port (DNS traffic, for example), lower this value to something like "5".
  • (todo) Different queue algos: FQ_CODEL,
  • txqueuelen should be increased on the interface if you are seeing dropped packets

Windows

Hypervisor Configuration

KVM / libvirt

vhost-net usually provides better performance than just the virtio driver (vhost-net can be thought of as a complementary enhancement to virtio). To enable vhost-net, do:

$ sudo modprobe vhost_net

If you aren't able to use vhost-net, make sure to at least use the virtio driver regardless.

virtio-multiqueue can also increase performance (todo: elaborate).

If you're using an Open vSwitch-based environment, look into OVS acceleration such as dpdk (todo: elaborate. relevant? more info?)

Xen

VMWare

Hyper-V

OpenStack Configuration

Instance and Image Configuration

  • PCI passthrough can be used to give an instance direct access to a NIC.
  • SR-IOV might also provide benefits.
  • Network IO quotas and shares
    • not advanced enough
    • instead, using libvirt hooks
    • todo: elaborate on this

Validation, Benchmarking, and Reporting

General Tools

iftop is a top-like tool for network traffic.

Benchmarking

iperf can be considered the standard for network benchmarking

Metrics

System

The collection of /sys/class/net/*/statistics/* files can provide a wealth of network-based metrics. Additionally, /proc/net/protocols can provide further information and metrics.

Instance

nova diagnostics can provide network statistics of the instance.

virsh domiflist and virsh domifstat can also be used to obtain network statistics on KVM/libvirt-based hypervisors.

Disk

Symptoms of Being Disk Bound

Compute nodes that are disk bound might see extremely high load values in the range of 50+. They will also see a large iowait value (which can be seen using the iostat utility).

General Hardware Recommendations

Spindle vs SSD

Some find using SSD-based disks for logging useful. CERN has tested SSD with bcache and have found it successful.

Be aware that some have run into too many faulty SSD disks for them to consider them worthwhile. This should not scare you from using SSD disks, just something to keep in mind.

  • SSD: TRIM, trim requests from guest aren't passed to hypervisor

Hardware RAID, Software RAID, no RAID?

Some people either don't use a hardware RAID card or will create individual RAID0 drives and pass them through to the compute node. They will then use mdadm to provide the drive resiliency.

Hardware RAID5 has been mentioned to provide the best durability. Others use RAID1 for the operating system and JBOD for ephemeral. This configuration does not provide resiliency for ephemeral disks, though.

For best performance, ensure the filesystem write size matches the RAID stripe size.

If you use a hardware RAID card with a battery backup, be aware that if the battery dies and writes switch from "write back" to write through", you will incur a performance hit.

Operating System Configuration

Linux

XFS and EXT4 are the most common filesystems to use.

For XFS, turn on barrier for performance, but not for database-related activity.

The decision to use either the CFQ or deadline kernel schedules is highly workload specific. (todo: elaborate)

Caching, of course, can offer great performance benefits, but be aware of the data loss that will incur if the cache is ever lost during an event such as a power failure.

If your workload can incur data loss, having cache=unsafe and mounting the guest root with barrier=0 can increase performance. See here for information.

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

This article contains useful information about KVM/libvirt and caching.

VirtIO SCSI (virtio-scsi) is a para-virtualized SCSI controller and is a successor to virtio-blk. It enables SCSI-passthrough and, in certain cases, enables the guest to better detect volume disconnects. As well, it sets the instance's devices names to the more standard /dev/sdx. To enable virtio-scsi in the guest, see Instance and Image Configuration.

Xen

VMWare

Hyper-V

OpenStack Configuration

This article is a great reference for the many ways that backing disks can be configured in OpenStack. The many configuration combinations all have advantages and disadvantages depending on your overall storage environment (todo: elaborate).

Image Formats

qcow files are smaller than raw files due to thin-provisioning. qcow also has the advantage of being able to do "copy-on-write" with a backing file.

See the following blueprint for ways in which qcow can be configured for performance.

If you are using Redhat, pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity. It's recommended to disable the prelinker to avoid this.

Images Type

Some positive experience on performance for I/O has been reported using images_type=lvm. This involves setting up a volume group and then allocating the VMs from dedicated block devices to the VM using LVM on the hypervisor.

Overcommit

Disk overcommitting is generally a safe thing to do. Thin-provisioning can increase the amount of available storage by not allocating empty storage.

Instance and Image Configuration

Image Metadata

To take advantage of virtio-scsi, add the following key/value pairs to an image:

hw_disk_bus_model=virtio-scsi
hw_scsi_model=virtio-scsi
hw_disk_bus=scsi

Guest Notes

tuned is a utility that can adaptively configure a system. The "virtual-guest" profile has been known to work well for guests.

A performance increase was not seen when switching between IDE and SCSI block drivers. SCSI has support for TRIM, but guest-originated TRIM requests are currently ignored (verify?).

Disk IO limits can be enforced on both ephemeral disks and volumes. We need to determine how to effectively apply these limits, though (help!)

On guest images, it's recommended to use the noop scheduler as well as to turn off mlocate and prelinking.

Validation, Benchmarking, and Reporting

Benchmarking

  • fio is a great tool for extensive benchmarks.
  • bonnie++ is great for quick benchmarks.

Metrics

System
  • iowait
  • iops
  • iostats
  • vmstat
  • sysstat (sar metrics)
Instance

nova diagnostics can pull the IO activity from an instance's disks.

virsh domblklist and virsh domblkstat can also be used on KVM/libvirt-based hypervisors to pull disk statistics.

References