Jump to: navigation, search

VirtDriverGuestCPUMemoryPlacement

Revision as of 15:11, 7 May 2014 by Daniel Berrange (talk | contribs) (vCPU topology)

Virtualization Driver Guest CPU / Memory Placement

Scope

This page considers all the issues relating to CPU / memory resources when doing guest placement between and within virtualization hosts.

Terminology

  • vCPU - aka virtual CPU - a logical CPU assigned to a guest. A single vCPU may be either a socket, core or thread, according to the guest topology
  • pCPU - aka physical CPU - a logical CPU present on a virtualization host. A single pCPU may be either a socket, core or thread, according to the host topology.
  • NUMA - Non Uniform Memory Access architecture - access time depends on the locality between the memory page and processor core
  • node - a unit with a NUMA system holding either CPUs or memory or both
  • cell - a synonym for node, used by libvirt APIs
  • socket - a discrete CPU chip / package within a NUMA node.
  • core - a processing core within a CPU socket
  • thread - aka hyper thread - a processing pipeline within a CPU core.
  • KSM - Kernel Shared Memory - Linux memory page sharing technology
  • THP - Transparent Huge Pages - Linux technology for proactively using huge pages for process RAM allocations

Background

vCPU topology

Each virtualization driver in OpenStack has its own approach to defining the CPU topology seen by guest virtual machines. The libvirt driver will expose all vCPUs as individual sockets, with 1 core and no hyper-threads. While operating systems are generally technically capable of using any/all vCPU count / topology combinations there can be a few important caveats

  • Licensing - OS vendors licensing rules may restrict the number of a sockets an OS will use. This can force a preference for cores instead of sockets
  • Performance - not all topologies are equal in their performance characteristics. For example, 2 host threads on the same core will offer less performance than 2 host threads on different cores. Similarly some sockets have internal NUMA topology where 1/2 their cores are in 1 NUMA node and 1/2 cores in another node.

The performance implications mean that the cloud administrator may wish to retain some level of control over how guest OS use host topology or expose guest topology. That said the variation in performance between different vCPU topologies is likely to be dwarfed by the variation allowed by NUMA and large page configurations, so it is valid to leave this entirely upto the end user. The only time it makes sense to guest a guest threads>1 is if every guest vCPU is being strictly bound to host pCPUs and some of those host pCPUs are also threads.

The licensing implications mean that a user uploading images to glance may need to indicate some topology constraints for execution of their image. The cloud administrator may also wish to change the hypervisor defaults so that users don't hit common license restrictions. ie for a 4+ vCPU guest, limit to 2 sockets max by default, with remaining vCPUs set to be cores, so Windows images will be handled correctly "out of the box"

NUMA topology

Host NUMA

Any virtualization host in the modern era will have NUMA topology with RAM and pCPU sockets spread across 2 or more NUMA nodes. While some CPU models introduce a second level of NUMA, these will not be considered further in this document, since this only has minimal cache effects on performance.

The key factors driving usage of NUMA are memory bandwidth, efficient cache usage and locality of PCIe I/O devices. For example, each NUMA node would have dedicated bus to memory that is local to the node, but access to remote RAM may be across a bus that is shared with all nodes. Consider a 4 NUMA node system with 1 GB/s memory bandwidth per node, and 1 GB/s for shared bus. If all processes always use local RAM then there is 4 GB/s potential memory bandwidth. If all processes always use remote RAM, there is only 1 GB/s potential memory bandwidth. Usage of that shared bus might trigger unintended cache synchronization among the NUMA nodes, leading to a significant performance impact for memory-intensive workloads. When I/O performance is critical, the assignment of devices attached to remote PCIe buses (i.e. attached to a different NUMA node) might have severe effects in the performance degradation, adding to cache inefficiency the waste of resources in the shared intra-node bus.

Thus, incorrect placement of guests on NUMA nodes, or incorrect choice of PCI devices to assign, leads to a serious waste of virtualization host resources. The impact of this will dwarf any benefits from other memory/cpu setup decisions made such as vCPU topology, pCPU/vCPU pinning or use of large pages. Thus a standard policy will be to place guests such that they are entirely confined within a single NUMA node.

Guest NUMA

If the guest vCPU/RAM allocation is too large to fit inside a single NUMA node, or insufficient PCIe devices are available to assign from a NUMA node, a policy decision must be made. Either execution of the guest on the host in question would be rejected in favour of a host with larger NUMA nodes, or the guest may be allowed to span across multiple NUMA nodes. This decision may be changed if the guest is later relocated, for example, if evacuating a host for maintenance it may be desirable to push a guest off to a host with sub-optimal NUMA placement and accept the temporary performance impact. The answer to the placement question will depend on the use cases involved in the guest deployment. For example, NFV deployments will favour strict NUMA placement, where execution is rejected if the guest cannot fit in the desired NUMA node.

If it is decided to have a guest span multiple NUMA nodes, then to enable the guest OS to maximise utilization of the resources it has been allocated, NUMA topology must also be exposed to the guest. The guest NUMA nodes should then be directly mapped to host NUMA nodes. This will entail mapping guest RAM chunks to host RAM nodes, and setting vCPU affinity to pCPUs. For example if a guest has 4 vCPUs and will be placed across 2 host NUMA nodes, then vCPUs 0+1 will be tied to the first host NUMA node, and vCPUS 2+3 will be tied to a second host NUMA node. It is not mandatory for vCPUs 0+1 to be tied to specific pCPUs within the host NUMA node - they can be free floating within the node at the will of the host OS schedular. If the host has hyperthreading enabled, however, then it is desirable to expose hyperthreading to the guest and at the same time strictly set vCPU<->pCPU affinity even within the node - ie do not allow any free-floating of vCPUs.

Guest NUMA topology can be configured with little regard to the guest OS being run. If a guest OS does not support NUMA, then it would simply ignore the datatables exposed in the virtual BIOS. That said, the guest vCPU topology constraints will influence what particular NUMA topologies are viable to expose. ie if the guest only supports max 2 sockets, then there is little point in setting up 4 NUMA nodes with 1 socket 1 core in each. It would have to have 2 NUMA nodes with 1 socket, 2 cores in each.

Large pages

Most CPUs in the modern era have support for multiple memory page sizes, ranging from 4k through 2MB/4MB upto as large as 1 GB. Typically the smallest page size will be used by default for all processes. If a non-negligible amount of RAM can be setup as large pages, however, the size of the CPU page tables can be significantly reduced which improves the hit rate of the page table caches and thus overall memory access latency. With the operating system using small pages by default, over time the physical RAM can be fragmented making it harder to find contiguous blocks of RAM required to allocate large pages. This problem becomes worse as the size of large pages increases. Thus if there is an desire to use large pages it is preferrable to instruct the host kernel to reserve them at initial boot time. Current Linux kernels do not allow this reservation to be made against specific NUMA nodes, but this limitation will be lifted in the near future. A further restriction is that the first 1 GB of host RAM cannot be used for 1GB huge pages, due to presence of MMIO holes.

Linux kernels have support for a feature called "transport huge pages" (THP) which will attempt to proactively allocate huge pages to back application RAM allocations where practical todo so. A problem with relying on this feature is that the owner of the VM has no guarantee which of their guests will be allocated large pages and which will be allocated small pages. Certain workloads / uses cases, such as NFV, will favour explicit huge page allocation in order to have guaranteed performance characteristics, while others may be satisfied by allowing the kernel to perform oppportunistic huge page allocation.

Since RAM blocks are directly associated with specific NUMA nodes, by implication, large pages are also directly associated with NUMA nodes. Thus when placing guests on NUMA nodes, the compute service may need to take into account their large page needs when picking amongst possible hosts or NUMA nodes. ie two hosts may have NUMA nodes able to hold the guest, but only one host may have sufficient large pages free in the NUMA nodes.

Large pages can be enabled for guest RAM without any regard to whether the guest OS will use them or not. ie if the guest OS chooses not to use huge pages, it will merely see small pages as before. Conversely though, if a guest OS does intend to use huge pages, it is very important that the guest RAM be backed by huge pages otherwise the guest OS will not be getting the performance benefit it is expecting.

Dedicated resource

Compute nodes typically have defined over commit ratios for host CPUs and RAM. ie 16 pCPUs may allow execution of a total of 256 vCPUs, and 16 GB of RAM may allow execution of guests totalling 24 GB of RAM. The concept of over commit extends into basic NUMA placement, however, when large pages are added to the mix, over commit ceases to be an option for RAM. There must be a 1-1 mapping between guest RAM and host RAM for large page usage, and the host OS won't consider any huge pages allocated to the guest for swapping, so this precludes any RAM overcommit.

Any use of the large pages feature will thus necessarily imply support for the concept of "dedicated resource" flavours for RAM at least, though at that point it would make sense to extend it for vCPUS too.

Even when doing dedicated resource allocation per guest with no overcommit or RAM or CPUs, there will be a need for CPU + RAM reservations to run host OS services. If using large pages, an explicit decision must be made as to how much RAM to reserve for host OS usage. With CPUs there is more flexibility, since host OS services can always steal time from guests even if the guests have been allocated dedicated pCPUs to execute on. It may none-the-less be desirable to reserve certain pCPUs exclusively for host OS services, to avoid OS services having an unpredictable impact on performance of the guests. It is already possible to tell Nova to reserve a subset of CPUs for OS services, a facility which will continue to be supported and perhaps even enhanced.

In the event that a host is using memory over commit and the guests actually consume all the RAM they are assigned, the host will resort to swapping guests. Swapping has a significant I/O impact on the host as a whole, so it does not make sense to mix guests with memory-overcommit and guests with dedicated RAM on the same host.

In the event that the host is using vCPU over commit and the guests all contend for vCPU, there can still be an impact on the performance of dedicated CPU hosts, due to cache effects, particularly if they are all within the same NUMA node. At the very least it is desirable to isolate dedicated CPU vs overcommit CPU guests on separate NUMA nodes, if not separate hosts.

Memory sharing / compression

Linux kernels include a feature known as "kernel shared memory" (KSM) in which RAM pages with identical contents can be shared across different processes. The kernel will proactively scan memory pages to identify identical pages and then merge them. Copy-on-write is used to unshare the page again if any process dirties a shared page. KSM can provide significant improvements in the utilization of RAM when many identical guest OS are run on the same host, or guests otherwise have identical memory page contents. The cost of KSM is increased CPU usage from the memory scanning, and a potential for spikes in memory pressure if guest suddenly do writes which trigger lots of unsharing of pages. The virtualization management layer must thus actively monitor the memory pressure situation and be prepared to migrate existing guests to other hosts if memory pressure increases to levels that cause an unacceptable amount of swap activity or even risk OOM.

The zswap feature allows for compression of memory pages prior to being written out to swap device. This reduces the amount of I/O performed to the swap device and thus reduces the performance degradation inherant in swapping of host memory pages.

Related resources (PCI)

Decisions made wrt placement of guests on host pCPU/RAM may in turn affect decisions to be made about allocation of other host resources related to the guest VM. For example, PCI devices have affinity with NUMA nodes, such that DMA operations initiated by the PCI are best performed with RAM on the local NUMA node. Thus the decision about which NUMA node to allocate a guest's vCPUs or RAM from, will directly influence which PCI devices and/or functions are acceptable to assign to the guest in order to maximise performance and utilization.

Technology availability

This section focuses on availability of technology in various hypervisors to support the concepts described

Libvirt / KVM

As of Apr 2014, libvirt supports

  • vCPU<->pCPU pinning
  • Host NUMA memory allocation placement + policy (ie which node to allocate guest RAM from, and whether this is mandatory or merely preferred policy)
  • Large pages for backing of VM RAM
  • Guest NUMA topology
  • Guest vCPU topology
  • Association of VMs into resource groups (using cgroups), which allows NUMA or schedular policies to be set for entire groups of guests at once.

A number of aspects are missing from libvirt, however, including

  • Association of guest NUMA nodes to host NUMA nodes
  • Control of large page allocation wrt NUMA nodes (depends on guest/host NUMA mapping)
  • Reporting on availability of free large pages / large page sizes
  • Control of large page size used for allocation
  • APIs to create/define/control VM resource groups (must be done by OS admin ahead of time)
  • Reporting of NUMA nodes associated with PCI devices

To allow dedicated resource allocation to specific guests there are a number of caveats/constraints

  • Dedicated pCPU. As well as setting the desired pCPU affinity on the guest in question, all other guests on the host must be forced to avoid the dedicated pCPU of the first guest. There are several approaches to achieve this
    • Create two resource groups for guests at host provisioning time, and split pCPU resources between the 2 groups. Start dedicated resource guests in one group and overcommit resource guests in the other group
    • Have hosts which are used solely for dedicated resource guests with no overcommit
    • Dynamically update the pCPU affinity of all existing guests when starting a dedicated resource guest
    • Set up-front pCPU affinity on all guests, to reserve some portion of pCPUs for later usage by dedicated guests
    • Set fixed schedular timeslices for the guests, but allow them to float freely across pCPUs
  • Dedicated RAM. There are again several approaches to achieve this
    • Use large pages for dedicated resource guests. This requires that the host have sufficient large pages free, and that the guest RAM be a multiple of large page size.
    • Create two resource groups for guests at host provisioning time, and split RAM resources between the 2 groups. Start dedicated resource guests in one group and overcommit resource guests in the other group

A complication of dedicated RAM allocation is that KVM has many different needs for RAM allocations beyond the primary guest RAM. There is guest video RAM, and arbitrarily sized allocations needed by the KVM process when processing I/O requests. To a lesser extent this also affects vCPU needs, since there are KVM emulator threads that do work on behalf of the guest. Further the host OS in general requires both CPU and RAM resources

VMWare

TBD

XenAPI

Design

Permissions

As a general rule, any time there are finite resources that are consumed by execution of a VM, the cloud administrator must have absolute control over the resource allocation. This in turn implies that the majority of the configuration work will be at the host level (nova.conf, etc) or at the flavour level. The only time where it is appropriate to permit end user image level config is for aspects which don't impact resource usage beyond what the flavour already allows for. From this it can be seen that the only parameters that are likely to be permissible at the image level are those related to vCPU topology, since that has negligible impact on host resource utilization, primarily being a mechanism for complying with software licensing restrictions.

Configuration

It should be clear from the background information that to maximise utilization of host resources, it is important to make full use of facilitaties such as NUMA and large pages. It follows from this that, even with zero configuration out of the box, it is desirable for Nova to make an effort to do best NUMA placement for guests, taking into account large pages where available. Explicit configuration should only be required in the subset of deployments which want to make a different set of performance/guest fit tradeoffs to suit specific requirements, or where the cloud provider wishes to artifically restrict placement to fit with different pricing tiers.

vCPU topology

  • The end user should have the ability to express the constraints their OS image has wrt to vCPU topology.
    • To restrict topology (eg max_sockets==2) used by the guest to comply with OS licensing needs.
  • The cloud administrator should have the ability to express the preferred or mandatory vCPU topology for guests against flavours
    • To place limits on the topologies an end user can specify, to prevent the user defining topologies that force sub-optimal NUMA placement.
    • To setup a default topology (eg max_sockets==2) to ensure guest OS images comply with common OS licensing needs without needing per-user image properties
  • Where there is a conflict between user image constraints and administrator flavour constraints, the flavour might take priority
    • ie if the flavour guest RAM is known to span multiple host NUMA nodes, the user's max_sockets=1 setting must be overriden by a flavour's min_sockets=2 setting to ensure that the scheduler isn't forced todo poor NUMA placement which would waste host resources

As noted in earlier discussion, the only time it makese sense to configure a guest with threads != 1, is if the guest vCPUs are being strictly bound to host pCPUs. As such there is no compelling reason to expose the ability to configure thread counts to the user. The virtualization driver should use threads if it is appropriate todo so based on the binding vCPU<->pCPU binding policy.

From this it could follow that there are 3 sets of parameters required to express vCPU topology, 1 set at the image and 2 sets at the flavour

  • image settings - constraints related to the OS associated with the image
    • max_sockets=N (maximum supported number of sockets, assume==INF if omitted)
    • max_cores=N (maximum supported number of sockets, assume==INF if omitted)
  • flavour settings - providing default behaviour, and minimum requirements
    • sockets=N (default number of sockets, assume ==vcpus or == vcpus/cores if omitted)
    • cores=N (default number of cores, assume ==vcpus/sockets if omitted)
    • min_sockets=N (minimum required number of sockets, assume==1 if omitted)
    • min_cores=N (minimum required number of sockets, assume==1 if omitted)


Priority of config parameters (lowest to highest): flavour default, image maximum, flavour minimum

Typical usage:

  • Zero config setup
    • N flavour vCPUs == N sockets
    • allows maximum flexilibilty with NUMA placement
  • Administrator sets flavour sockets=2
    • cores is calculated by dividing vcpu count by socket count.eg 6 vcpu flavour gets 2 sockets, 3 cores
    • Windows OS licensing works out of the box
  • User sets image max_sockets=2
    • max_sockets overrides the flavour default 'sockets' unless it violated flavour min_sockets
    • Windows OS licensing works, if the user launches it with a flavour that doesn't have conflicting min_sockets

NUMA topology

  • Administrator can define guest NUMA nodes against flavour
    • To force guest RAM to be considered as multiple NUMA nodes to allow more efficient placement on host NUMA nodes
    • Administrator shoud declare vCPU topology to satisfy sockets-per-node needs. ie if setting 2 nodes, then at least set min_sockets=2

A simple approach is to simply specify the number of NUMA nodes desired. The RAM and sockets would be divided equally across nodes. This minimises complexity of configuration parameters. If no NUMA node count was defined, then the hypervisor is free to use whatever NUMA topology it wants to in the guest, if any.

  • Zero config setup
    • Hypervisor chooses how many NUMA nodes to setup as it sees fit based on how the guest RAM / vCPU allocation best fits into host RAM/vCPU availability
  • Administrator sets numa_nodes=1
    • Hypervisor never sets any NUMA topology for the guest, even if guest RAM/vCPU allocation exceeds host RAM/vCPU availability in a single node.
  • Administrator sets numa_nodes=4
    • Hypervisor sets up 4 guest NUMA nodes and spreads RAM + vCPUs equally across nodes
  • Adminsitrator sets vcpus=6,numa_nodes=2,vcpus.0=0,1,vcpus.1:2,3,4,5,mem.0=2,mem=1=4
    • Hypervisor sets up 2 NUMA nodes, the first with vcpus 0 & 1 and 2 GB of RAM, the second node with vcpus 2, 3, 4, 5 and 4 GB of RAM.

Note that administrator never defines anything about how guest is placed into host NUMA nodes. The hypervisor will always decide this as best it can based on how the guest topology is configured. The schedular would be able to provide to some rules for picking hosts whose NUMA topology best fit with the needs of the flavour.

Large pages

  • Administrator can define large page usage policy against flavour
    • To define a high performance, flavour that is guaranteed 1 GB pages
    • To prevent use of large pages by a flavour, to ensure they are available for other flavours
  • Zero config setup
    • Hypervisor chooses whether or not to make use of large pages as it sees fit
  • Administrator sets page_sizes=large
    • Hypervisor will not start the guest unless it can find large pages are available
  • Administrator sets page_sizes=any
    • Hypervisor will try to find largest pages first, but fallback to smaller pages if not available
  • Administrator sets page_sizes=small
    • Hypervisor will never use large pages for the guest, even if available
  • Administrator sets page_sizes=1GB
    • Hypervisor will not start the guest unless it can find 1 GB large pages. Will not use 2 MB large pages even if available

Dedicated resources

  • Administrator can define that a flavour has dedicated pCPUs
    • To guarantee a flavour which has zero contention from other guests
  • Administrator can define that a flavour has exclusive pCPUS
    • To guarantee a flavour which has zero contention from other guests or OS service/kernel threads

Based on this some configuration scenarios are

  • Zero config
    • Hypervisor will freely overcomit RAM or vCPUs
  • Admin sets overcommit_ram=0 on flavour
    • Hypervisor will assign dedicated RAM for the guest, but can still overcommit vCPUs
  • Admin sets overcommit_vcpus=0 on flavour
    • Hypervisor will assign dedicated vCPUs for the guest, but can still overcommit RAM
  • Admin sets overcommit_ram=0,overcommit_vcpus=0 on flavour
    • Hypervisor will assign dedicated vCPUs and RAM for the guest

Schedular

Currently, libvirt and other drivers (xenapi ?) will report their CPU info. Libvirt uses this information at the moment to check for compatibility between source and destination hypervisors when live-migrating, but nowhere else. This data also does not include any NUMA information, nor does it include any usage info. In order to make this useful for scheduling, we would need to make sure that compute hosts are exposing the needed information to the schedular.

It would also be good to make the format of the data that is currently kept in the database as a json blob, better defined and standardized across virt drivers. It may also be required that we change the way we store this information in the database for performance reasons which can prove important especially for scheduling.

Based on the above discussion especially regarding resource dedication - it seems that it would make sense to allow for two distinct types of compute hosts:

  • Allow oversubscription
  • Do not allow oversubscription (the host is reserved for strict placements regarding memory and CPU).

vCPU topology

As discussed above, some of the data is already exposed, however there is no utilisation being exposed. We will need to keep track of usage of cores per Numa node. When it comes to threads, we will want to assume a thread is a VCPU as we do now as there is not a huge performance gain compared to being confined to a NUMA node, but still expose topology to the guest when requested. vCPUs topology plays a role in scheduling when using flavors with no oversubscription, and can be used for weighing based on NUMA placements when scheduling a non-strict guest.

NUMA placement

Correct NUMA placement should be attempted for all guests. It is likely that we will want to weight hosts based on weather an instance can be placed on a single NUMA node, as well as allow for stacking and spreading (like we do with vcpu and memory at the moment). As Flavors are fully controlled by the admin - they should expose weather an instance can require dedicated NUMA placement.

Hosts that are meant to be used without oversubscription should expose this to the scheduler through a new column in the compute_host table. This can in turn be set at deployment time through a config option. Although this makes scheduling and writing additional filters to support this easier - it can complicate deployment as weather a host will be used for dedicated NUMA placement must be decided at deploy time.

The data that nodes will need to expose is a list of NUMA nodes with free and available sockets/cores and memory. We will need to consider CPU topology in both flavor and image when scheduling against this. This data will also need to be added to the compute_host data model.

Large pages

Unlike NUMA - huge pages if asked for by the flavor will need to be on host that allows for it and has pre-allocated them. It is likely that this will be required on same hosts that will be dedicated to NUMA as well so we might want to make this an explicit dependency. For example, if a host has huge pages configured, it is considered also as struct from NUMA perspective.

THP can be used on hosts that allow oversubscription, and the scheduling can take this into account if there is a request for huge pages that is not a hard rule but best effort.

Hosts can then report if they support reserved or THP, if they support pre-allocated, and in case of strict placement - they will report the free number of pages.

Based on the above we can have flavors that have huge_pages set to 'strict' which will mean that scheduler will fail the instance if no host has enough huge pages free to satisfy the hard memory requirement in a single NUMA node.

Dedicated resources