Jump to: navigation, search

Documentation/HypervisorTuningGuide

< Documentation
Revision as of 05:01, 16 November 2015 by Jtopjian (talk | contribs) (OpenStack Configuration)

Contents

About the Hypervisor Tuning Guide

The goal of the Hypervisor Tuning Guide (HTG) is to provide cloud operators with detailed instructions and settings to get the best performance out of their hypervisors.

This guide is broken into four major sections:

  • CPU
  • Memory
  • Network
  • Disk

Each section has tuning information for the following areas:

  • Symptoms of being (CPU, Memory, Network, Disk) bound
  • General hardware recommendations
  • Operating System configuration
  • Hypervisor configuration
  • OpenStack configuration
  • Instance and Image configuration
  • Validation, benchmarking, and reporting

How to Contribute

Simply add your knowledge to this wiki page! The HTG does not yet have a formal documentation repository. It's still very much in initial stages.

Understanding Your Workload

I imagine this section to be the most theoretical / high level out of the entire guide.

References

CPU

Introduction about CPU.

Symptoms of Being CPU Bound

Compute Nodes that are CPU bound will generally see CPU usage at 80% or higher and idle usage less than 20%.

CPU Usage

On Linux-based systems, you can see CPU usage using the vmstat tool. As an example:

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  35576 312096  95752 1104936    0    0     0     1    0    0  0  0 100  0  0

For CPU, there are five key areas to look at:

  1. r under procs: The number of running processes. If this number is consistently higher than the number of cores or CPUs on your compute node, then there are consistently more jobs being run than your node can handle.
  2. us under cpu: This is the amount of CPU spent running non-kernel code.
  3. sy under cpu: This is the amount of CPU spent running kernel code.
  4. id under cpu: This is the amount of idle CPU.
  5. st under cpu: This is the amount of stolen CPU. If an instance consistently sees a high st value, then the compute node hosting it might be under a lot of stress.

Load

Linux-based systems have an abstract concept of "Load". A high load means the system is running "hot" while a low load means it's relatively idle. But what number constitutes hot and cold? It varies from system to system. A general rule of thumb is that the system load will equal 1 when a core or CPU is consistently processing a job. Therefore, a normal load is equal to the number of cores / CPUs or less.

However, exaggeratedly high loads (100+) are usually an indication of IO problems and not CPU problems.

Load should not be used as the sole metric when diagnosing potential Compute Node problems. It's best to use Load as an indication to check further areas.

General Hardware Recommendations

Simultaneous Multithreading

Simultaneous multithreading (SMT), commonly known as Hyper-threading in Intel CPUs, is a technology that enables the Operating System to see a single core / CPU as two cores / CPUs. This feature can usually be enabled or disabled in the Compute Node's BIOS.

It's important to understand that SMT will not make jobs run faster. Rather, it will allow two jobs to run simultaneously where only one job would have run before. Thus, SMT, in some cases, can increase the amount of completed jobs within the same time span than if it was turned off. CERN has seen a throughput increase of 20% with SMT enabled ([1])

The following guidelines are known for specific use cases:

  • Enable it for general purpose workloads
  • Disable it for virtual router applications

Notable CPU Flags

The following CPU flags have special relevance to virtualization. On Linux-based systems, you can see what flags are enabled and functional by doing:

$ cat /proc/cpuinfo

TBD

Operating System Configuration

Linux

  • exclude cores, dedicate cores / cpus specifically for certain OS tasks
    • iso cpu
    • see rh blog post below
  • reasonable increase in performance by compiling own kernels
  • turn off cpu scaling - run at full frequency
  • TSC
    • x86 Specific - different architectures have different timekeeping mechanisms
    • can be virtualised or not
    • can run one core slower than another
    • clock source
    • avoiding jitter - eg asterisk, telephony
    • time stamp counter, is it a tec vs hpet thing?

Windows

  • virtio drivers

Hypervisor Configuration

KVM / libvirt

Xen

VMWare

Hyper-V

  • has numa spanning enabled by default, should be disabled for performance, caveat with restarting instance

OpenStack Configuration

CPU Mode and Model

The two most notable CPU-related configuration options in Nova are:

  1. cpu_mode
  2. cpu_model

Both of these items can be read about in detail in the config reference. Additionally, CERN's experience with benchmarking cpu_mode can be found here.

Overcommitting

You can configure Nova to report that the Compute Node has more CPUs than it really does by altering the cpu_allocation_ratio setting on each Compute Node. This setting can either take a whole or fraction of a number. For example:

  • cpu_allocation_ratio=16.0: Configures Nova to report it has 16 times the number of CPUs than what the Compute Node really has. This is the default.
  • cpu_allocation_ratio=1.5: Configures Nova to report it has 1.5 times the number of CPUs.
  • cpu_allocation_ratio=1.0: Effectively disables CPU overcommitting.

Generally, it's safe to overcommit CPUs. It has been reported that the main reason not to overcommit CPU is because of not overcommitting memory (which will be explained in the Memory section of this guide).

Note: You must also make sure scheduler_default_filters contains CoreFilter in order to use cpu_allocation_ratio.

  • RAM overcommit, particularly with KSM, has a CPU hit as well

Instance and Image Configuration

  • Describe scenarios where the instance sees a CPU flag but cannot use it.
  • CPU quotas and shares
    • Reported use-case: default of 80% on all flavors, if workloads are very cpu heavy, don't do.
  • guest kernel scheduler set to "none" (elevator=noop on kernel command line)
    • what are the benefits of this? host and guest schedulers don't fight
  • Hyper-v enlightenment features
  • Hyper-v gen 2 vms are seen to be faster than gen 1, reason?


Validation, Benchmarking, and Reporting

General Tools

  • top
  • vmstat
  • htop

Benchmarking Tools

Metrics

System
  • CPU: user, system, iowait, irq, soft irq
Instance
  • nova diagnostics
  • Do not record per-process stats - explain why
  • overlaying cputime vs allocated cpu

Memory

Symptoms of Being Memory Bound

  • OOM Killer
  • Out of swap

General Hardware Recommendations

  • ensure numa distribution is balanced
  • memory speeds, vary by chip

Operating System Configuration

Linux

Kernel Tunables
  • Transparent Hugepages can go either way depending on workload
KSM
  • Might often cause performance (CPU) problems, better to turn it off

Windows

Hypervisor Configuration

KVM / libvirt

  • nova enables ballooning but doesn't actually use it
  • reserved_host_memory_mb (defaults: 512 mb which is too low for the real world)
  • Turn on/off EPT (see blog post)

Xen

VMWare

Hyper-V

OpenStack Configuration

Overcommitting

  • Memory Overcommit & the cost of swapping

Instance and Image Configuration

  • ensure ballooning is enabled / available
  • guests cannot see memory speed - not exposed like cpu flags are

Validation, Benchmarking, and Reporting

General Tools

  • free

Benchmarking

  • stream

Metrics

System
  • page in, page out, page scans per second, `free`
Instance
  • nova diagnostics
  • virsh

Network

Symptoms of Being Network Bound

  • from guest: soft irq will be high
  • high io wait for network-based instance disk
  • discards on switch

General Hardware Recommendations

  • Bonding
    • LACP vs balance-tlb vs balance-alb
  • VXLAN offload

Operating System Configuration

Linux

Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

  • vhost-net (on by default on most modern distros?)
  • virtio
    • virtio multiqueue
  • ovs acceleration (dpdk)
  • Out of order frames?

Xen

VMWare

Hyper-V

OpenStack Configuration

Instance and Image Configuration

  • PCI pass-through
  • Network IO quotas and shares
    • not advanced enough
    • instead, using libvirt hooks
  • 1500 MTU
  • Make sure the instance is actually using vhost-net (load the kernel module)

Validation, Benchmarking, and Reporting

General Tools

  • iftop

Benchmarking

  • iperf

Metrics

System
  • bytes in/out, packets in/out, irqs, pps
  • /proc/net/protocols
Instance
  • nova diagnostics
  • virsh
  • virtual nic stats

Disk

Symptoms of Being Disk Bound

  • Artificially high load with high CPU idle
  • iowait

General Hardware Recommendations

Spindle vs SSD

  • separate ssd for logs
  • too many faulty ssd disks
  • SSD: TRIM, trim requests from guest aren't passed to hypervisor
  • bcache with ssd
    • dmcache was less good

Hardware RAID, Software RAID, no RAID?

  • raid0 individual disks, pass through
  • durability: hardware raid5
  • ensure writes match stripe size of raid, on the filesystem level
  • raid1 for OS, JBOD for ephemeral
  • battery backup, will switch from back to through, performance hit at this time

Operating System Configuration

Linux

  • xfs barriers, turn off for performance, not for database
  • xfs on raid, tunables
  • xfs or ext4
  • LVM?
  • cfq instead of deadline - workload specific
  • tuned
  • File system recommendations and benefits
  • Caching and in-memory file systems?
  • bcache, see notes above
  • Turn off block I/O barrier, set tuned profile to 'virtual-guest'
Kernel Tunables

Windows

Hypervisor Configuration

KVM / libvirt

  • ignore sync calls from guest - dangerous, but fast
  • write-through, write-back
  • defaults are usually safe

Xen

VMWare

Hyper-V

OpenStack Configuration

  • Base images, copy on write

Image Formats

  • qcow: smaller, cow
  • qcow tunables
    • preallocation: full fallocate, metadata, no
    • format/version
  • preallocation: full fallocate, metadata, no
  • If you are using Redhat, redhat has pre-linking turned on by default. So after the VM boots, prelinker re-writes all the libraries and the qcow grows like crazy even with no disk activity.
    • we disbaled pre-linker to solve this


Overcommit

  • for ephemeral
  • for migration

Instance and Image Configuration

  • tuned
  • ide vs scsi, scsi didn't have a performance increase
    • scsi for supporting Trim, blk for IO performance
  • ioschedule: noop
  • Disk IO quotas and shares
    • yes on cinder
    • question on how to effectively use
  • turn off mlocate, prelinking

Validation, Benchmarking, and Reporting

Benchmarking

  • fio (extensive)
  • bonnie++ (quick)

Metrics

System
  • iowait
  • iops
  • iostats
  • vmstat
  • sysstat (sar metrics)
Instance
  • nova diagnostics
  • virsh

References