Virtualization Driver Guest CPU / Memory Placement

Scope

This page considers the issues relating to CPU / memory resources when doing guest placement between and within virtualization hosts. NOTE this document presents the results of initial investigation / discussion of the area. Any details of the Nova design mentioned here, have been superceeded by the following formal design specifications:

Terminology

vCPU - aka virtual CPU - a logical CPU assigned to a guest. A single vCPU may be either a socket, core or thread, according to the guest topology
pCPU - aka physical CPU - a logical CPU present on a virtualization host. A single pCPU may be either a socket, core or thread, according to the host topology.
NUMA - Non Uniform Memory Access architecture - access time depends on the locality between the memory page and processor core
node - a unit with a NUMA system holding either CPUs or memory or both
cell - a synonym for node, used by libvirt APIs
socket - a discrete CPU chip / package within a NUMA node.
core - a processing core within a CPU socket
thread - aka hyper thread - a processing pipeline within a CPU core.
KSM - Kernel Shared Memory - Linux memory page sharing technology
THP - Transparent Huge Pages - Linux technology for proactively using huge pages for process RAM allocations

Background

vCPU topology

Each virtualization driver in OpenStack has its own approach to defining the CPU topology seen by guest virtual machines. The libvirt driver will expose all vCPUs as individual sockets, with 1 core and no hyper-threads. While operating systems are generally technically capable of using any/all vCPU count / topology combinations there can be a few important caveats

Licensing - OS vendors licensing rules may restrict the number of sockets an OS will use. This can force a preference for cores instead of sockets
Performance - not all topologies are equal in their performance characteristics. For example, 2 host threads on the same core will offer less performance than 2 host threads on different cores. When seeing that a core has multiple threads OS schedulers will make special placement decisions for threads.

The performance implications of threads mean that it is not desirable to tell a guest it has multiple threads, unless those threads are being pinned 1-1 to host threads. If the guest vCPUs are free floating across host pCPUs, then the guest should just use cores/sockets and leave threads==1. It follows that there is no compelling reason to expose the ability to configure thread count to the user. At most the user may wish to indicate that their image does not want to execute on sibling threads, if the workload in the image is sensitive to such scenarios.

The licensing implications mean that a user uploading images to glance may need to indicate a topology constraint/preference for cores vs sockets for execution of their image. The cloud administrator may also wish to change the hypervisor defaults so that users don't hit common license restrictions. ie for a 4+ vCPU guest, limit to 2 sockets max by default, with remaining vCPUs set to be cores, so Windows images will be handled correctly "out of the box"

NUMA topology

Host NUMA

Any virtualization host in the modern era will have NUMA topology with RAM and pCPU sockets spread across 2 or more NUMA nodes. While some CPU models introduce a second level of NUMA, these will not be considered further in this document, since this only has minimal cache effects on performance.

The key factors driving usage of NUMA are memory bandwidth, efficient cache usage and locality of PCIe I/O devices. For example, each NUMA node would have dedicated bus to memory that is local to the node, but access to remote RAM may be across a bus that is shared with all nodes. Consider a 4 NUMA node system with 1 GB/s memory bandwidth per node, and 1 GB/s for shared bus. If all processes always use local RAM then there is 4 GB/s potential memory bandwidth. If all processes always use remote RAM, there is only 1 GB/s potential memory bandwidth. Usage of that shared bus might trigger unintended cache synchronization among the NUMA nodes, leading to a significant performance impact for memory-intensive workloads. When I/O performance is critical, the assignment of devices attached to remote PCIe buses (i.e. attached to a different NUMA node) might have severe effects in the performance degradation, adding to cache inefficiency the waste of resources in the shared intra-node bus.

Thus, incorrect placement of guests on NUMA nodes, or incorrect choice of PCI devices to assign, leads to a serious waste of virtualization host resources. The impact of this will dwarf any benefits from other memory/cpu setup decisions made such as vCPU topology, pCPU/vCPU pinning or use of large pages. Thus a standard policy will be to place guests such that they are entirely confined within a single NUMA node.

Guest NUMA

If the guest vCPU/RAM allocation is too large to fit inside a single NUMA node, or insufficient PCIe devices are available to assign from a NUMA node, a policy decision must be made. Either execution of the guest on the host in question would be rejected in favour of a host with larger NUMA nodes, or the guest may be allowed to span across multiple NUMA nodes. This decision may be changed if the guest is later relocated, for example, if evacuating a host for maintenance it may be desirable to push a guest off to a host with sub-optimal NUMA placement and accept the temporary performance impact. The answer to the placement question will depend on the use cases involved in the guest deployment. For example, NFV deployments will favour strict NUMA placement, where execution is rejected if the guest cannot fit in the desired NUMA node.

If it is decided to have a guest span multiple NUMA nodes, then to enable the guest OS to maximise utilization of the resources it has been allocated, NUMA topology must also be exposed to the guest. The guest NUMA nodes should then be directly mapped to host NUMA nodes. This will entail mapping guest RAM chunks to host RAM nodes, and setting vCPU affinity to pCPUs. For example if a guest has 4 vCPUs and will be placed across 2 host NUMA nodes, then vCPUs 0+1 will be tied to the first host NUMA node, and vCPUS 2+3 will be tied to a second host NUMA node. It is not mandatory for vCPUs 0+1 to be tied to specific pCPUs within the host NUMA node - they can be free floating within the node at the will of the host OS schedular. If the host has hyperthreading enabled, however, then it is desirable to expose hyperthreading to the guest and at the same time strictly set vCPU<->pCPU affinity even within the node - ie do not allow any free-floating of vCPUs.

Guest NUMA topology can be configured with little regard to the guest OS being run. If a guest OS does not support NUMA, then it would simply ignore the datatables exposed in the virtual BIOS. That said, the guest vCPU topology constraints will influence what particular NUMA topologies are viable to expose. ie if the guest only supports max 2 sockets, then there is little point in setting up 4 NUMA nodes with 1 socket 1 core in each. It would have to have 2 NUMA nodes with 1 socket, 2 cores in each.

Large pages

Most CPUs in the modern era have support for multiple memory page sizes, ranging from 4k through 2MB/4MB upto as large as 1 GB. Typically the smallest page size will be used by default for all processes. If a non-negligible amount of RAM can be setup as large pages, however, the size of the CPU page tables can be significantly reduced which improves the hit rate of the page table caches and thus overall memory access latency. With the operating system using small pages by default, over time the physical RAM can be fragmented making it harder to find contiguous blocks of RAM required to allocate large pages. This problem becomes worse as the size of large pages increases. Thus if there is an desire to use large pages it is preferrable to instruct the host kernel to reserve them at initial boot time. Current Linux kernels do not allow this reservation to be made against specific NUMA nodes, but this limitation will be lifted in the near future. A further restriction is that the first 1 GB of host RAM cannot be used for 1GB huge pages, due to presence of MMIO holes.

Linux kernels have support for a feature called "transparent huge pages" (THP) which will attempt to proactively allocate huge pages to back application RAM allocations where practical to do so. A problem with relying on this feature is that the owner of the VM has no guarantee which of their guests will be allocated large pages and which will be allocated small pages. Certain workloads / uses cases, such as NFV, will favour explicit huge page allocation in order to have guaranteed performance characteristics, while others may be satisfied by allowing the kernel to perform opportunistic huge page allocation.

Since RAM blocks are directly associated with specific NUMA nodes, by implication, large pages are also directly associated with NUMA nodes. Thus when placing guests on NUMA nodes, the compute service may need to take into account their large page needs when picking amongst possible hosts or NUMA nodes. ie two hosts may have NUMA nodes able to hold the guest, but only one host may have sufficient large pages free in the NUMA nodes.

Large pages can be enabled for guest RAM without any regard to whether the guest OS will use them or not. ie if the guest OS chooses not to use huge pages, it will merely see small pages as before. Conversely though, if a guest OS does intend to use huge pages, it is very important that the guest RAM be backed by huge pages otherwise the guest OS will not be getting the performance benefit it is expecting.

Dedicated resource

Compute nodes typically have defined over commit ratios for host CPUs and RAM. ie 16 pCPUs may allow execution of a total of 256 vCPUs, and 16 GB of RAM may allow execution of guests totalling 24 GB of RAM. The concept of over commit extends into basic NUMA placement, however, when large pages are added to the mix, over commit ceases to be an option for RAM. There must be a 1-1 mapping between guest RAM and host RAM for large page usage, and the host OS won't consider any huge pages allocated to the guest for swapping, so this precludes any RAM overcommit.

Any use of the large pages feature will thus necessarily imply support for the concept of "dedicated resource" flavours for RAM at least, though at that point it would make sense to extend it for vCPUS too.

Even when doing dedicated resource allocation per guest with no overcommit or RAM or CPUs, there will be a need for CPU + RAM reservations to run host OS services. If using large pages, an explicit decision must be made as to how much RAM to reserve for host OS usage. With CPUs there is more flexibility, since host OS services can always steal time from guests even if the guests have been allocated dedicated pCPUs to execute on. It may none-the-less be desirable to reserve certain pCPUs exclusively for host OS services, to avoid OS services having an unpredictable impact on performance of the guests. It is already possible to tell Nova to reserve a subset of CPUs for OS services, a facility which will continue to be supported and perhaps even enhanced.

In the event that a host is using memory over commit and the guests actually consume all the RAM they are assigned, the host will resort to swapping guests. Swapping can have a significant I/O impact on the host as a whole, so it may not make sense to mix guests with memory-overcommit and guests with dedicated RAM on the same host if strict isolation of these different workloads is required.

In the event that the host is using vCPU over commit and the guests all contend for vCPU, there can still be an impact on the performance of dedicated CPU hosts, due to cache effects, particularly if they are all within the same NUMA node. Thus if strict isolation of workloads is required, it will be desirable to isolate dedicated CPU vs overcommit CPU guests on separate NUMA nodes, if not separate hosts.

Memory sharing / compression

Linux kernels include a feature known as "kernel shared memory" (KSM) in which RAM pages with identical contents can be shared across different processes. The kernel will proactively scan memory pages to identify identical pages and then merge them. Copy-on-write is used to unshare the page again if any process dirties a shared page. KSM can provide significant improvements in the utilization of RAM when many identical guest OS are run on the same host, or guests otherwise have identical memory page contents. The cost of KSM is increased CPU usage from the memory scanning, and a potential for spikes in memory pressure if guest suddenly do writes which trigger lots of unsharing of pages. The virtualization management layer must thus actively monitor the memory pressure situation and be prepared to migrate existing guests to other hosts if memory pressure increases to levels that cause an unacceptable amount of swap activity or even risk OOM.

The zswap feature allows for compression of memory pages prior to being written out to swap device. This reduces the amount of I/O performed to the swap device and thus reduces the performance degradation inherant in swapping of host memory pages.

Related resources (PCI)

Decisions made wrt placement of guests on host pCPU/RAM may in turn affect decisions to be made about allocation of other host resources related to the guest VM. For example, PCI devices have affinity with NUMA nodes, such that DMA operations initiated by the PCI are best performed with RAM on the local NUMA node. Thus the decision about which NUMA node to allocate a guest's vCPUs or RAM from, will directly influence which PCI devices and/or functions are acceptable to assign to the guest in order to maximise performance and utilization.

Technology availability

This section focuses on availability of technology in various hypervisors to support the concepts described

Libvirt / KVM

As of Apr 2014, libvirt supports

vCPU<->pCPU pinning
Host NUMA memory allocation placement + policy (ie which node to allocate guest RAM from, and whether this is mandatory or merely preferred policy)
Large pages for backing of VM RAM
Guest NUMA topology
Guest vCPU topology
Association of VMs into resource groups (using cgroups), which allows NUMA or schedular policies to be set for entire groups of guests at once.

A number of aspects are missing from libvirt, however, including

Association of guest NUMA nodes to host NUMA nodes
Control of large page allocation wrt NUMA nodes (depends on guest/host NUMA mapping)
Reporting on availability of free large pages / large page sizes
Control of large page size used for allocation
APIs to create/define/control VM resource groups (must be done by OS admin ahead of time)
Reporting of NUMA nodes associated with PCI devices

To allow dedicated resource allocation to specific guests there are a number of caveats/constraints

Dedicated pCPU. As well as setting the desired pCPU affinity on the guest in question, all other guests on the host must be forced to avoid the dedicated pCPU of the first guest. There are several approaches to achieve this
- Create two resource groups for guests at host provisioning time, and split pCPU resources between the 2 groups. Start dedicated resource guests in one group and overcommit resource guests in the other group
- Have hosts which are used solely for dedicated resource guests with no overcommit
- Dynamically update the pCPU affinity of all existing guests when starting a dedicated resource guest
- Set up-front pCPU affinity on all guests, to reserve some portion of pCPUs for later usage by dedicated guests
- Set fixed schedular timeslices for the guests, but allow them to float freely across pCPUs

Dedicated RAM. There are again several approaches to achieve this
- Use large pages for dedicated resource guests. This requires that the host have sufficient large pages free, and that the guest RAM be a multiple of large page size.
- Create two resource groups for guests at host provisioning time, and split RAM resources between the 2 groups. Start dedicated resource guests in one group and overcommit resource guests in the other group

A complication of dedicated RAM allocation is that KVM has many different needs for RAM allocations beyond the primary guest RAM. There is guest video RAM, and arbitrarily sized allocations needed by the KVM process when processing I/O requests. To a lesser extent this also affects vCPU needs, since there are KVM emulator threads that do work on behalf of the guest. Further the host OS in general requires both CPU and RAM resources

VMWare

TBD

XenAPI

Limited vCPU topology by setting cores-per-socket value. No hyperthread count support http://support.citrix.com/article/CTX126524

Design

Permissions

As a general rule, any time there are finite resources that are consumed by execution of a VM, the cloud administrator must have absolute control over the resource allocation. This in turn implies that the majority of the configuration work will be at the host level (nova.conf, etc) or at the flavour level. The only time where it is appropriate to permit end user image level config is for aspects which don't impact resource usage beyond what the flavour already allows for. From this it can be seen that the only parameters that are likely to be permissible at the image level are those related to vCPU topology, since that has negligible impact on host resource utilization, primarily being a mechanism for complying with software licensing restrictions.

Configuration

It should be clear from the background information that to maximise utilization of host resources, it is important to make full use of facilitaties such as NUMA and large pages. It follows from this that, even with zero configuration out of the box, it is desirable for Nova to make an effort to do best NUMA placement for guests, taking into account large pages where available. Explicit configuration should only be required in the subset of deployments which want to make a different set of performance/guest fit tradeoffs to suit specific requirements, or where the cloud provider wishes to artifically restrict placement to fit with different pricing tiers.

vCPU topology

The end user should have the ability to express the constraints their OS image has wrt to socket vs cores choice
- To restrict topology (eg max_sockets==2) used by the guest to comply with OS licensing needs.
The cloud administrator should have the ability to express the preferred or mandatory vCPU topology for guests against flavours
- To place limits on the topologies an end user can specify, to prevent the user defining topologies that force sub-optimal NUMA placement.
- To setup a default topology (eg max_sockets==2) to ensure guest OS images comply with common OS licensing needs without needing per-user image properties
Where there is a conflict between user image constraints and administrator flavour constraints, the flavour might take priority
- ie if the flavour guest RAM is known to span multiple host NUMA nodes, the user's max_sockets=1 setting must be overriden by a flavour's min_sockets=2 setting to ensure that the scheduler isn't forced todo poor NUMA placement which would waste host resources

As noted in earlier discussion, the only time it makese sense to configure a guest with threads != 1, is if the guest vCPUs are being strictly bound to host pCPUs. This isn't something that an end user needs to consider, but an administrator wish to be able to setup flavours which explicitly avoid placement on a host with threads. This can be achieved by configuring host groups using schedular aggregates

From this it could follow that the following parameters are relevant to vCPU topology:

image settings
- sockets=N (actual number of desired sockets. Calculate from other settings if omitted)
- cores=N (actual number of desired cores. Calculate from other settings if omitted)
- max_sockets=N (maximum supported number of sockets, assume==INF if omitted)
- max_cores=N (maximum supported number of cores, assume==INF if omitted)
flavour settings
- sockets=N (default number of sockets, assume ==vcpus or == vcpus/cores if omitted)
- cores=N (default number of cores, assume ==vcpus/sockets if omitted)
- max_sockets=N (maximum number of permitted sockets, assume==INF if omitted)
- max_cores=N (minimum required number of permitted cores, assume==INF if omitted)

The flavour settings will always override the image settings, if both are specified.

Typical usage:

Zero config setup
- N flavour vCPUs == N sockets
- allows maximum flexilibilty with NUMA placement
Administrator sets flavour sockets=2
- cores is calculated by dividing vcpu count by socket count.eg 6 vcpu flavour gets 2 sockets, 3 cores
- Windows OS licensing works out of the box
User sets image max_sockets=2
- max_sockets causes preference for cores if flavour vCPU count is greater than 2.
- Windows OS licensing works
User sets image cores=4
- Guest will always use 4 cores, provided it is below flavour max_cores value
- If flavour has 4 vCPUs, then guest will be in 1 socket and thus confined to 1 NUMA node.

NUMA topology

Administrator can define guest NUMA nodes against flavour
- To force guest RAM to be considered as multiple NUMA nodes to allow more efficient placement on host NUMA nodes
- Administrator shoud declare vCPU topology to satisfy sockets-per-node needs. ie if setting 2 nodes, then at least set min_sockets=2

A simple approach is to simply specify the number of NUMA nodes desired. The RAM and sockets would be divided equally across nodes. This minimises complexity of configuration parameters. If no NUMA node count was defined, then the hypervisor is free to use whatever NUMA topology it wants to in the guest, if any. It might be that there are several viable NUMA configuration depending on the host chosen by the schedular, however, the admin may wish to cap the number of NUMA nodes used.

Zero config setup
- Hypervisor chooses how many NUMA nodes to setup as it sees fit based on how the guest RAM / vCPU allocation best fits into host RAM/vCPU availability
Flavour administrator sets numa_nodes=1
- Hypervisor never sets any NUMA topology for the guest, even if guest RAM/vCPU allocation exceeds host RAM/vCPU availability in a single node.
Flavour administrastor sets numa_max_nodes=2
- Hypervisor will pick a host where the guest is spread across at most 2 NUMA nodes. So the guest may be placed in 1 single NUMA node, or in 2 NUMA nodes, but will never be spread across 4 NUMA nodes.
Flavour administrator sets numa_nodes=2
- Hypervisor sets up 2 guest NUMA nodes and spreads RAM + vCPUs equally across nodes. It will not use a host where the guest fits in 1 NUMA node, nor 4 NUMA nodes.
Flavour administrator sets vcpus=6,numa_nodes=2,vcpus.0=0,1,vcpus.1:2,3,4,5,mem.0=2,mem=1=4
- Hypervisor sets up 2 NUMA nodes, the first with vcpus 0 & 1 and 2 GB of RAM, the second node with vcpus 2, 3, 4, 5 and 4 GB of RAM.

Note that administrator never defines anything about how guest is placed into host NUMA nodes. The hypervisor will always decide this as best it can based on how the guest topology is configured. The scheduler would be able to provide to some rules for picking hosts whose NUMA topology best fit with the needs of the flavour.

Large pages

Administrator can define large page usage policy against flavour
- To define a high performance, flavour that is guaranteed 1 GB pages
- To prevent use of large pages by a flavour, to ensure they are available for other flavours

Zero config setup
- Hypervisor chooses whether or not to make use of large pages as it sees fit
Administrator sets page_sizes=large
- Hypervisor will not start the guest unless it can find large pages are available
Administrator sets page_sizes=any
- Hypervisor will try to find largest pages first, but fallback to smaller pages if not available
Administrator sets page_sizes=small
- Hypervisor will never use large pages for the guest, even if available
Administrator sets page_sizes=1GB
- Hypervisor will not start the guest unless it can find 1 GB large pages. Will not use 2 MB large pages even if available

Dedicated resources

Administrator can define that a flavour has dedicated pCPUs
- To guarantee a flavour which has zero contention from other guests
Administrator can define that a flavour has exclusive pCPUS
- To guarantee a flavour which has zero contention from other guests or OS service/kernel threads

Based on this some configuration scenarios are

Zero config
- Hypervisor will freely overcomit RAM or vCPUs
Admin sets overcommit_ram=0 on flavour
- Hypervisor will assign dedicated RAM for the guest, but can still overcommit vCPUs
Admin sets overcommit_vcpus=0 on flavour
- Hypervisor will assign dedicated vCPUs for the guest, but can still overcommit RAM
Admin sets overcommit_ram=0,overcommit_vcpus=0 on flavour
- Hypervisor will assign dedicated vCPUs and RAM for the guest

Scheduler

Currently, libvirt and other drivers (xenapi ?) will report their CPU info. Libvirt uses this information at the moment to check for compatibility between source and destination hypervisors when live-migrating, but nowhere else. This data also does not include any NUMA information, nor does it include any usage info. In order to make this useful for scheduling, we would need to make sure that compute hosts are exposing the needed information to the schedular.

It would also be good to make the format of the data that is currently kept in the database as a json blob, better defined and standardized across virt drivers. It may also be required that we change the way we store this information in the database for performance reasons which can prove important especially for scheduling.

From the background information regarding execution of instances with dedicated resources, it is clear that, at minimum, the scheduler needs to have the assign instances to hosts according to whether the host runs dedicated resource workloads or overcommit workloads. Not all deployments, however, will require or desire strict separation of dedicated resource workloads from overcommit workloads, since it leads to less flexible / efficient utilization of compute hosts. It is thus also valid to allow the schedular to mix dedicated resource and overcommit instances on a single host. eg if a host has 8 GB of RAM and 2 GB of huge pages reserved, it can run 2 GB of dedicated resource guests fairly easily and still have 6 GB available for overcommit workloads.

vCPU topology

The libvirt driver currently exposes the pCPU topology (ie sockets, cores, threads), but there is no general utilization information for CPU resource. Within the context of a single NUMA node there is no significant performance differentiation between sockets and cores, so the scheduler should not need to be concerned with matching host/guest core/socket counts. The core/socket count can be determined by the compute driver, once the schedular has made the decision based on NUMA requirements.

If a flavour expresses anti-affinity for threads, then the schedular will want to avoid placing the VM on hosts which have threads>1 for their pCPUs.

NUMA placement

The schedular needs to take into account the 'numa_nodes' setting on the flavour when deciding where to place guests.

If 'numa_nodes' is not set, then the schedular is free to make an arbitrary decision as to where the run the guest regardless of whether the flavour RAM fits into a single NUMA node on the target host or not. It will still probably want to prefer hosts where it could fit into a single NUMA node though.

If 'numa_nodes' is set to 1, then the schedular must only place a guest on a host where the flavour RAM fits into a single NUMA node.

If 'numa_nodes' is > 1, then the schedular should place the guest on a host where flavour RAM / numa_nodes fits into the size of the host's NUMA nodes.

The compute nodes will need to expose information about their NUMA topologies (ie CPUs and RAM per node) and the current utilization of resources in these topologies. This data will need to be added to the compute_host data model.

Large pages

Unlike NUMA - huge pages if asked for by the flavor will need to be on host that allows for it and has pre-allocated them. It is likely that this will be required on same hosts that will be dedicated to NUMA as well so we might want to make this an explicit dependency. For example, if a host has huge pages configured, it is considered also as struct from NUMA perspective.

THP can be used on hosts that allow oversubscription, and the scheduling can take this into account if there is a request for huge pages that is not a hard rule but best effort.

Hosts can then report if they support reserved or THP, if they support pre-allocated, and in case of strict placement - they will report the free number of pages.

Based on the above we can have flavors that have huge_pages set to 'strict' which will mean that scheduler will fail the instance if no host has enough huge pages free to satisfy the hard memory requirement in a single NUMA node.

VirtDriverGuestCPUMemoryPlacement

Contents

Virtualization Driver Guest CPU / Memory Placement

Scope

Terminology

Background

vCPU topology

NUMA topology

Host NUMA

Guest NUMA

Large pages

Dedicated resource

Memory sharing / compression

Related resources (PCI)

Technology availability

Libvirt / KVM

VMWare

XenAPI

Design

Permissions

Configuration

vCPU topology

NUMA topology

Large pages

Dedicated resources

Scheduler

vCPU topology

NUMA placement

Large pages

Dedicated resources