Revision as of 17:24, 23 April 2014

Virtualization Driver Guest CPU / Memory Placement

Scope

This page considers all the issues relating to CPU / memory resources when doing guest placement between and within virtualization hosts.

vCPU topology (cores, sockets, threads) (obsoleting https://wiki.openstack.org/wiki/VirtDriverGuestCPUTopology)
NUMA (host NUMA placement and guest NUMA model)
Large page (2MB/4MB/.../1GB) allocation for guest RAM
pCPU/vCPU affinity
Dedicated resource instances

Background

vCPU topology

Each virtualization driver in OpenStack has its own approach to defining the CPU topology seen by guest virtual machines. The libvirt driver will expose all vCPUs as individual sockets, with 1 core and no hyper-threads. While operating systems are generally technically capable of using any/all vCPU count / topology combinations there can be a few important caveats

Licensing - OS vendors licensing rules may restrict the number of a sockets an OS will use. This can force a preference for cores instead of sockets
'Performance - not all topologies are equal in their performance characteristics. For example, 2 host threads on the same core will offer less performance than 2 host threads on different cores. Simiarly some sockets have internal NUMA topology where 1/2 their cores are in 1 NUMA node and 1/2 cores in another node.

The performance implications mean that the cloud administrator may wish to retain some level of control over how guest OS use host topology or expose guest topology.

The licensing implications mean that a user uploading images to glance may need to indicate some topology constraints for execution of their image. The cloud administrator may also wish to change the hypervisor defaults so that users don't hit common license restrictions. ie limit to 2 sockets max by default, so Windows images will be handled correctly "out of the box"

NUMA topology

Host NUMA

Any virtualization host in the modern era will have NUMA topology with RAM and sockets being spread across 2 more more NUMA nodes. Some virtualization hosts will have multiple levels of NUMA, for example sockets forming the first NUMA node level, and cores within a socket forming the second NUMA node level. The relationship between RAM and nodes may vary at these levels. For example, at the first NUMA level each node would have an associated RAM block, while at the second NUMA level a RAM block may be shared between both nodes, only cache being separate. There can even be machines where RAM and sockets are in completely separate nodes.

The key factor driving usage of NUMA nodes is memory bandwidth. For example, each NUMA node would have dedicated bus to memory that is local to the node, but access to remote RAM may be across a bus that is shared with all nodes. Consider a 4 NUMA node system with 1 GB/s memory bandwidth per node, and 1 GB/s for shared bus. If all processes always use local RAM then there is 4 GB/s potential memory bandwidth. If all processes always use remote RAM, there is only 1 GB/s potential memory bandwidth.

Thus, incorrect placement of guests on NUMA nodes leads to a serious waste of virtualization host resources. The impact of this will dwarf any benefits from other memory/cpu setup decisions made such as vCPU topology, pCPU/vCPU pinning or use of large pages. Thus a standard policy will be to place guests such that their are entirely confined within a single NUMA node.

Guest NUMA

If the guest vCPU/RAM allocation is too large to fit inside a single NUMA node a policy decision must be made as to whether to reject execution of the guest on the host in question in favour of a host with larger NUMA nodes, or whether to allow it to span across multiple NUMA nodes. This decision may be changed if the guest is later relocated. For example, if evacuating a host for maintenance it may be desirable to push a guest off to a host with sub-optimal NUMA placement and accept the temporary peformance impact.

If it is decided to have a guest span multiple NUMA nodes, then to enable the guest OS to maximise utilization of the resources it has been allocated, NUMA topology must be exposed to the guest. The guest NUMA nodes should then be directly mapped to host NUMA nodes. This will entail mapping guest RAM chunks to host RAM nodes, and setting vCPU affinity to pCPUs. For example if a guest has 4 vCPUs and will be placed across 2 host NUMA nodes, then vCPUs 0+1 will be tied to the first host NUMA node, and vCPUS 2+3 will be tied to a second host NUMA node. vCPUs 0+1 do not have to be tied to specific pCPUs within the host NUMA node - they can be free floating within the node at the will of the host OS schedular.

Guest NUMA topology can be configured with little regard to the guest OS being run. If a guest OS does not support NUMA, then it would simply ignore the datatables exposed in the virtual BIOS. That said, the guest vCPU topology constraints will influence what particular NUMA topologies are viable to expose. ie if the guest only supports max 2 sockets, then there is little point in setting up 4 NUMA nodes with 1 socket 1 core in each. It would have to have 2 NUMA nodes with 1 socket, 2 cores in each.

Large pages

Most CPUs in the modern era have support for multiple memory page sizes, ranging from 4k through 2MB/4MB upto as large as 1 GB. Typically the smallest page size will be used by default for all processes. If a non-negligable amount of RAM can be setup as large pages, however, the size of the CPU page tables can be significantly reduced which improves the hit rate of the page table caches and thus overall memory access latency. With the operating system using small pages by default, over time the physical RAM can be fragmented making it harder to find contiguous blocks of RAM required to allocate large pages. This problem becomes worse as the size of large pages increases.

Linux kernels have support for a feature called "automatic huge pages" which will attempt to proactively allocate huge pages to back application RAM allocations where practical todo so. A problem with relying on this feature is that the owner of the VM has no guarantee which of their guests will be allocated large pages and which will be allocated small pages. Typically only a subset of guests will really care about the performance improvements to be had from large pages, so it is desirable to have explicit control over large page usage rather than relying on unpredictable automatic allocation.

Since RAM blocks are directly associated with specific NUMA nodes, by implication, large pages are also directly associated with NUMA nodes. Thus when placing guests on NUMA nodes, the compute service may need to take into account their large page needs when picking amongst possible hosts or NUMA nodes. ie two hosts may have NUMA nodes able to hold the guest, but only one host may have sufficient large pages free in the NUMA nodes.

Large pages can be configured without any regard to the guest OS. From the guest OS POV, it will merely see the default (usually smallest) page size, regardless of whether its RAM is backed by large or small pages on the host.

Over commit / dedicated resource

Compute nodes typically have defined over commit ratios for host CPUs and RAM. ie 16 pCPUs may allow execution of a total of 256 vCPUs, and 16 GB of RAM may allow execution of guests with 24 GB of RAM. The concept of over commit extends into basic NUMA placement, however, when large pages are added to the mix, over commit ceases to be an option for RAM. There must be a 1-1 mapping between guest RAM and host RAM for large page usage, precluding any RAM overcommit.

Any use of the large pages feature will thus neccessarily imply support for the concept of "dedicated resource" flavours for RAM at least, though at that point it would make sense to extend it for vCPUS too.

Even when doing dedicated resource allocation per guest with no overcommit or RAM or CPUs, there will be a need for CPU + RAM reservations to run host OS services. If using large pages, an explicit decision must be made as to how much RAM to reserve for host OS usage. With CPUs there is more flexibility, since host OS services can always steal time from guests even if the guests have been allocated dedicated pCPUs to execute on. It may none-the-less be desirable to reserve certain pCPUs exclusively for host OS services

Related resources

Decisions made wrt placement of guests on host pCPU/RAM may in turn affect decisions to be made about allocation of other host resources related to the guest VM. For example, it may impact decision about which PCI device virtual functions to assign to a guest.

Difference between revisions of "VirtDriverGuestCPUMemoryPlacement"