Enhanced-platform-awareness-pcie

Background:
There is a growing movement in the telecommunications industry to transform the network. This transformation includes the distinct, but mutually beneficial disciplines of Software Defined Networking and Network Functions Virtualization. One of the challenges of virtualizing appliances in general, and virtualizing network functions in particular, is to deliver near native (i.e. non-virtualized) performance.

Many virtual appliances have intense I/O requirements, many can benefit from access to high performance instructions or accelerators for workloads such as cryptography, and others would like to get direct access to GPUs.

Network Functions Virtualization
Network Functions Virtualization (NFV) is an Industry Specification Group (ISG) in the European Telecommunications Standards Institute (ETSI). Leading network operators, TEMs, IT vendors and technology providers have joined the ISG. One of their goals is to “Develop requirements and architecture specifications for the hardware and software infrastructure required to support these virtualized functions, as well as guidelines for developing network functions.”[Ref 1]

One of the challenges of virtualizing functions in general, and virtualizing network functions in particular is to deliver near native (i.e. non-virtualized) performance. To help address this challenge, developers of virtualized network functions leverage virtualization technologies traditionally associated with IT deployments. They also develop their application with specific drivers for platform features such cryptographic accelerators and may have optimized code that utilizes specific core instructions.

Enhanced Platform Awareness
It can be generalized that there is a growing demand for the cloud OS to have a greater awareness of the capabilities of the platforms it controls. The Enhanced Platform Awareness (EPA) related updates proposed to OpenStack aim to facilitate better informed decision making related to VM placement and help drive tangible improvements for cloud tenants.

This proposal focuses on how to leverage PCIe devices in cloud infrastructure, and looks in particular at Single Root IO Virtualization (SR-IOV) as one technology that can be used to dramatically improve the performance available to the virtual machine. This proposal builds upon ideas already proposed in several forum discussions and the following blueprints:
 * nova/xenapi-gpu-passthrough (https://blueprints.launchpad.net/nova/+spec/xenapi-gpu-passthrough
 * nova/pci-passthrough https://blueprints.launchpad.net/nova/+spec/pci-passthrough.
 * nova/pci-passthrough-and-sr-iov https://blueprints.launchpad.net/nova/+spec/pci-passthrough-and-sr-iov

Sample Use Case
A sample use case for the feature described herein relates to the deployment in a private cloud of a virtualized Evolved Packet Core node (part of the wireless core infrastructure).

The Evolved Packet Data Gateway (ePDG) is used to secure connections to User Equipment over untrusted non-3GPP access infrastructure via IPsec tunnels. Access to hardware acceleration and/or software optimization for this crypto workload could really benefit the throughput capability. The IPsec stack in the ePDG could be developed to leverage cryptographic algorithm acceleration via a virtualized function driver. It could have an optimized code implementation leveraging specific core instructions, and if both did not exist, it could revert to an un-optimized solution with a considerable performance reduction.

From the perspective of deploying this application and its Virtual Machine in a private heterogeneous cloud three of the salient challenges are:
 * Find a platform with the specific platform feature (e.g. instruction or accelerator) that the application has been developed for.
 * In the accelerator case, allocate an SR-IOV Virtual Function to the VM running the ePDG application.
 * Enable live-migration.

Existing solutions in OpenStack?
Host Aggregates does offer a potential partial solution by having the administrator sub-divide an Availability Zone into some number of aggregates, set the metadata (key-value pairs) for the aggregate and assign the compute nodes to the aggregate. However this requires that the administrator knows where the capability exists which could become unwieldy and error prone in large deployments, particularly as the number of capabilities that the admin may want to provide to their users grows. Even with host aggregates, the other aspect of a solution missing is that the VM.XML generation support does not exist to allocate a virtual function to the VM.

One intended benefit of this blueprint proposal is to ease the burden on the administrator through automatic detection of the platform capabilities.

Proposed Updates to OpenStack

 * Initial focus leverages libvirt and KVM although it should be extensible to other hypervisors.
 * The initial user is most likely to be in a private cloud deployment.
 * Note: Although Intel developed technologies and devices have been mentioned in this blueprint as examples of usage, it is intended that the design proposal is also applicable to solutions from other vendors.

Step 1a: Retrieve

 * Retrieve the list of PCIe devices in xml format from libvirt via the listDevices.
 * For each device returned in the first call, the additional info for that device can be determined with a call to nodeDeviceLookupByName.
 * Convert XML list to python dict.

Step 1b: Whitelist

 * A whitelist concept is proposed to offer the administrator an option to specify which devices may be exposed to the scheduler.
 * The whitelist will be applied to dict.
 * Whitelist specified as an additional entry in nova.conf.
 * If it is specified, only devices in the list are reported
 * If the whitelist is not specified all devices are reported.

Step 1c: Alias

 * To simplify and abstract the PCIe info that needs to be shared with the controller scheduler and stored in nova-db, an alias mechanism is proposed.
 * The alias mechanism can keep libvirt specifics formats in the libvirt driver and allow a more generic view to be exposed to the scheduler.
 * The alias mapping for this string could be provided by vendors and added to nova.conf by the administrator.
 * The alias presented to the scheduler will be a simplified key/value pair:
 * --key= --value=
 * The alias name would be translated in the libvirt driver to a string describing a PCIe device such as:
 * capabilities:devices: (capability_type=pci & capability: (vendor_id= & product_id=) & device_type=)
 * Note: The keywords above are: capabilities, devices, capability_type, capability, product_id, vendor_id, device_type.
 * device_type selected from set {NIC, ACCEL, GPU, etc.}
 * This setting permits different types of libvirt allocation of the Virtual Function.
 * Example alias mapping for the Intel® QuickAssist Integrated Accelerator functionality in the Intel® Communications Chipset 8910 (Intel® DH8910 PCH):
 * QuickAssist=capabilities:devices: (capability_type=pci & capability: (product_id=0x443 & vendor_id=0x8086), & device_type=ACCEL)

Step 1d: Store PCIe info

 * The PCIe info identified above in addition to the alias mapping need to be stored in the compute-node.
 * It is proposed that the tracking of this info and allocation status (free or unallocated) of PCIe devices should happen in the Resource Tracker on the compute node.

Step 1e: Relay PCIe existence and allocation state to the controller

 * Notify nova-db of PCIe devices and their allocation status.
 * Follow the same design pattern as related to notification of info for vCPUs, memory etc.
 * Info to be stored in the DB include:
 * The alias name, number of virtual functions, number of free virtual functions.

Step 2: Extend Nova Scheduler to handle PCIe device allocation
Solution targets the ComputeCapababilitiesFilter and ImagePropertiesFilter.
 * Taking the ComputeCapababilitiesFilter as an example.
 * In order to associate a PCIe device with a VM, the Administrator would associate an extra_spec key/value pair with a flavor. An example of requesting one Virtual Function on the Intel QuickAssist Integrated Accelerator would be
 * --key=QuickAssist --value=1
 * When the tenant or administrator launches an instance of the flavor, during scheduling nova will find a compute node with the specified device and number of Virtual Functions available.
 * Once the compute node has been selected, the request is sent to the compute node libvirt driver to create the VM with the PCIe device Virtual Functions allocated.

Step 2: Future Extensions

 * Although not covered in this submission, it is envisioned that greater levels of abstraction could be layered on top of this low level view of PCIe devices. This could be particularly interesting for public cloud type of deployments where the request for a PCIe type of capability would be more appropriately positioned at the “10G Ethernet” or “Crypto Accelerator” type of level.

Step 3: Configure the VM for Deployment
Extend VM.xml generation (in virt/libvirt/driver.py, and virt/libvirt/config.py) to include allocation of a PCIe device virtual function to a VM.
 * Pre-allocate the number of requested Virtual Functions associated with Physical PCIe device for the VM.
 * Follow same design semantic as pre-allocation of vCPUs.
 * The libvirt driver can use the device_type setting to choose the device assignment mechanism that it will use. For an accelerator device, the following libvirt XML entry would be applicable:


 * The bus, slot and function values can be determined by the nova-compute libvirt driver by looking up the nova-compute Resource Tracking for PCIe devices.
 * Choosing the “managed” setting in libvirt will help to support live-migration.
 * Libvirt will detach the Virtual Function from the VM, and re-attach it to the host before migration.
 * On the destination compute node, libvirt will attempt to attach an identical virtual function to the VM.
 * The bus, slot and function values can be determined by the nova-compute libvirt driver by looking up the nova-compute Resource Tracking for PCIe devices.
 * Choosing the “managed” setting in libvirt will help to support live-migration.
 * Libvirt will detach the Virtual Function from the VM, and re-attach it to the host before migration.
 * On the destination compute node, libvirt will attempt to attach an identical virtual function to the VM.
 * On the destination compute node, libvirt will attempt to attach an identical virtual function to the VM.

Not covered by this blueprint
It is envisioned that this technique should be applicable for GPU allocation and NIC allocation. Allocation of a NIC SR-IOV Virtual Function into the VM via a libvirt entry such as  introduces several network related considerations that need to be catered for. Additions to the mechanism for creating ports in Quantum could help give Quantum a trigger for executing solutions (e.g. plugin) designed to address the additional network requirements.