Meetings/Passthrough

Discussion
- Nova boot to launch VM with SR-IOV port without port being precreated in Neutron
- Existing Abandoned patches:
  - https://review.openstack.org/#/c/67503/
  - https://review.openstack.org/#/c/67500/

Paris Summit Task List
- Kilo task list with priorities from Paris Summit Meetup

Agenda on Oct. 28th, 2014

Bugs:

Reviews:

Blueprints:
- Live migration with macvtap SR-IOV

Agenda on May 26th, 2014

Nova SR-IOV networking
- nova spec review
- patch submission soon
- tempest tests
Bugs
- Init PCI manager after compute node initialization
- makes sure correct PCI device allocation
Other works/features
- Live migration
- Attach SR-IOV port to an existing instance
- possibly changing the names of existing vnic-types
- associate a default vnic-type with a neutron network
- admin controlled knobs
- HA
- others

Agenda on March 25th, 2014

PCI SR-IOV Networking use case

Refer to https://docs.google.com/document/d/1zgMaXqrCnad01-jQH7Mkmf6amlghw9RMScGLBrKslmw/edit

Recap of Discussions

Current PCI Passthrough

How to Use Wiki: https://wiki.openstack.org/wiki/Pci_passthrough

PCI whitelist: defines all the PCI passthrough devices that are available on a compute node. It's currently based on <vendor_id> <product_id>
PCI stats group: defines the keys based on which a PCI device is accounted. It's currently based on the keys: <vendor_id> <product_id> <extra_info or simply PCI address of a physical function>.
PCI Alias: specifies a list of PCI device requirements. A PCI requirement specifies a dictionary with keys: <alias name> <vendor_id> <product_id> <device_type>. Multiple requirements can be specified per PCI Alias with the same <device_type> (which may not work yet due to bugs). <alias name> is required. Others are optional.
Nova Server Flavor: PCI requirements can be added into a nova flavor as it's extra-spec in the syntax "pci_passthrough:alias"="<PCI Alias Name>:<count>{,<PCI Alias Name>:<count>}"
PCI Passthrough filter for nova scheduler: this filter works based on the PCI stats group and PCI aliases referenced in the nova flavor. If mutliple aliases exist in the flavor, all of the them have to be satified. If multiple PCI requirements exist in one PCI alias, only one of the requirements defined in the alias needs to be satisfied. Suppose a PCI requirement is represented as R, and a PCI alias as (R1 OR R2 OR ...):count. Further assuming two PCI aliases, with the first one having two PCI requirements, with the second having one PCI requirement. Logically speaking, in order to choose a host as candidate, it must satisfy

       (R11 OR R12):count AND R21:count

Given a PCI requirement, it's used to match against PCI stats groups until one or multiple matches are found to satisfy the count. Note that matching is based on the keys existing in the PCI requirement. Therefore it's possible for a PCI requirement to be matching multiple PCI stats groups.

Motivation for enhancement

PCI whitelist only uses <vendor_id> and <product_id>. This doesn't work if a compute node has multiple vNICs from the same vendor that are not used in the same way.
PCI requirement only uses <vendor_id> and <product_id>. This doesn't work when the vNIC ports are connected to different physical networks . In other words, PCI devices from the same vendor may not be treated equally when it comes to choose vNIC ports for a VM.

What we have discussed so far to satisfy the above requirements

PCI Group

Refer to https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs/edit#heading=h.30de7p6sgoxp. The doc becomes messy and almost unreadable. But refer to https://wiki.openstack.org/wiki/Meetings/Passthrough#Agenda_on_Jan._8th.2C_2014 for the main ideas behind it.
PCI Group: a collection of PCI devices that share the same functions or belong to the same subsystem in a cloud
PCI whitelist: each entry is defined as a tuple of <device-filter, pci-group-name>. The device-filter uses vendor_id, product_id and PCI addresses to specify a collection of PCI devices.
PCI stats group: each PCI group is a PCI stats group
PCI Alias: alias is no longer needed
Nova Server Flavor: Use G to represent a PCI group. An exampel of PCI requirements can be specified as:

                 (G1 OR G2):count AND G3:count

It defines that the VM needs "count" of PCI devices that are either from G1 or G2, and 'count" of PCI devices from G3.

It supports only one well-defined tag called PCI group, which is considered too limited by some of the folks. It simplifies the existing implementation without losing any of its capabilities (or use cases the current implementation can support).

PCI Flavor

Refer to https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit

PCI whitelist: it's called PCI information in the above wiki
PCI stats group: not clearly defined
PCI flavor Attributes: refer to the wiki
PCI flavor API: refer to the wiki

This can be considered a generalized version of PCI Group as it supports arbitrary number of arbitrary tags. But system behaviors (PCI stats groups and scheduling, for example) are not clearly specified. Use cases to justify it is yet to be established.

Host Aggregate

It was brought up during the discussion that the existing host aggregate may be used to support the new requirement. A couple of issues with host aggregate were discussed:

it doesn't support dynamic host join to a host aggregate
scheduling with host aggregate is not stats based.

Nic Type/Flavor

Refer to http://lists.openstack.org/pipermail/openstack-dev/2014-January/023981.html and http://lists.openstack.org/pipermail/openstack-dev/2013-December/022737.html

A compromised implementation

to speed up the development and facilitate integration testing with neutron. Refer to https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov. Also refer to the submitted patch.

BPs & Patches

Neutron Side

RKukura (binding:vif_details and binding:profile)
- https://blueprints.launchpad.net/neutron/+spec/vif-details
- https://blueprints.launchpad.net/neutron/+spec/ml2-binding-profile
Sadasu (Cisco MD)
- https://blueprints.launchpad.net/neutron/+spec/ml2-ucs-manager-mechanism-driver
Irenab (vnic_type and mlnx MD)
Baoli (for anything else that's needed for SRIOV in neutron)
- https://blueprints.launchpad.net/neutron/+spec/pci-passthrough-sriov

Nova Side

Itzik's patch to add physical network in VIF

https://review.openstack.org/#/c/59093/

Agenda on Feb. 13th, 2014

Discuss the Neutron SR-IOV's requirements on the nova generic PCI support, and their availability
- in the PCI information definition:
  - the pci device filter expression (I don't see it's given a name in https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support_Icehouse) should support (which may be done as an enhancement).
    - specifying a physical function in the form of domain:bus:slot.func
    - specifying a physical function in the form of a ethernet interface name.
  - tagging with an attribute called "net-group" or simply "physical_network". The tag name "phyical_network" now may make more sense based on the consensus reached on Feb. 12th.
- PCI stats based on the tag "physical_network"
- An API to retrieve a tag's value on a per PCI device basis, something like: get_pci_device_tag_value(pci_dev, tag_name)
- An API to create a PCI request for scheduling purpose
- An API to retrieve a PCI device that is linked to the original PCI request.
- The existing pci_manager.get_instance_pci_devs(instance) shouldn't return PCI devices that are allocated as a result of the aforementioned PCI requests.

Feb. 12th, 2014 Recap

Consensus reached during this meeting:

The binding:profile dictionary will now have these keys defined: 'physical_network', 'pci_vendor_info', 'pci_slot'.
- physical_network: its value is the physical network name that has been chosen for the nic (neutron port) to attach to. ML2 MD will use this information:
  - In the non-SR-IOV case with the agent-based MDs, the ML2 plugin's port binding code iterates over the registered MDs to try to bind, calling bind_port() on each. Within the AgentMechanismDriverBase.bind_port() implementation, it iterates over the network segments, calling check_segment_for_agent() on the derived class for each segment. The 1st segment that is tried for which the agent on the node identified by binding:host_id has a mapping for the segment's physical_network is used.
  - in the non-SR_IOV case, need to add a line of code in the existing agent-based MDs check_segment_for_agent() to make sure vnic_type == 'virtio', so it won't bind when SR-IOV is required. Irenab's vnic BP will take care of this.
  - in the SR-IOV case, the physical network names that a neutron port can potentially attach to will be used for scheduling, and the change to the scheduling filter in the case of multiprovidernet extension is TBD.
  - in the SR-IOV case, SR-IOV MDs will need to iterate over the segments looking for the 1st one that has network_type of 'vlan' and that has the physical_network specified in binding:profile:physical_network; the SR-IOV MDs can include the segment's segmentation_id within the binding:vif_details so that VIF driver can put that into the libvirt XML
- pci_vendor_info: its value is a string with the format "vendor_id:product_id". Both vendor_id and product_id corresponds to the PCI device's vendor_id and product_id
- pci_slot: its value is a string with the format "domain:bus:slot.func" that correspond's to the PCI device's slot as named on a linux system.
The binding:vif_details will have keys depending on the neutron port's vif type
- profileid: this key will be used to support the vif type VIF_TYPE_802_1QBH
- vlan_id: this key will be used to support the vif type VIF_TYPE_HW_VEB [irenab - neutron uses 'segmentation_id'. Let's pass it in the binding:vif_details]
interface config and resulting interface XML will be generated based on both vnic_type and vif_type. The vnic_type, if not present as a key in the top level port dictionary, defaults to 'virtio'

Agenda on Jan 28th, 2014

nova-neutron-sriov

Openstack Icehouse schedule: https://wiki.openstack.org/wiki/Icehouse_Release_Schedule
Work Items for the initial release:
- Nova: generic PCI-passthrough [Nova folks to add details]
  - pci_information support, add tag/extra information to pci devices.
    - pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, { 'e.group' :'gpu' } }
  - pci flavor define attr can be used in the pci flavor and how pci stats report it's pool
    - pci_flavor_attrs = ['e.group']
  - pci schduler to support corresponding extra information.
- Nova: SRIOV
  - Dependencies on Nova Generic PCI-Passthrough
    - Support of PCI attribute sriov_group in the PCI passthrough device list
    - Support of PCI stats based on sriov_group
  - overal change breakdown
    - Nova server and scheduler changes:
      - requested network and SRIOV request spec management
    - Nova compute
      - SRIOV request spec--vif association
      - neutronv2 API in nova that supports interaction with neutron. Particularly speaking:
        to support the enhanced port binding [irenab: vnic_type, pci slot record]
        
        to support an enhanced vif dictionary (e.g., vlan id is missing from the dictionary)
      - libvirt driver to support sriov
      - vif driver to support sriov, especially to generate config for sriov device and to generate interface/network xml
      - to support live migration assuming per interface network xml
- Neutron:
  - - enhance neutron port-create to support vnic-type and port profile [ irenab: pci slot record]
    - to support the enhanced port binding and as a result API changes in the main plugin
    - to enhance the neutron client in support of the enhanced interaction between nova and neutron [Might not be needed]
    - to enhance various plugins, especially ml2 plugin to support the enhanced port binding (particularly vnic-type)

Agenda on Jan 27th, 2014

Main focus is on SRIOV
References:
- https://docs.google.com/document/d/1RfxfXBNB0mD_kH9SamwqPI8ZM-jg797ky_Fze7SakRc/edit?pli=1
- https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit?pli=1#
Some ideas that came out of discussions
- Assuming there is an attribute (based on pci flavor terminology) for sriov, which name is yet to be determined. Call it sriov_group temporarily for reference in below
- Assuming PCI stats will be supported based on this attribute
- Assuming for each physical network that's supported in the cloud, there is a corresponding sriov_group. To make it simple, say if there are two physical networks: physnet1 and physnet2, then there are two sriov_groups named physnet1 and physnet2
- sriov devices in the pci passthrough device list (the pci whitelist) will be tagged by "sriov_group: physnet1" or "sriov_group: physnet2" based on which physical network the device belongs to.
- Initially, we don't have to change nova CLI/API. Instead, the neutron port-create will be enhanced to support the following new arguments:
  - --vnic-type = “vnic” | “direct” | “macvtap”. OPTIONAL with default as "vnic". The actually names of the three possible vnic types for further discussion
  - port-profile=port-profile-name for support of IEEE802.1BR. OPTIONAL.
  - pci-flavor=pci-flavor-name. Support of this argument is debatable. If sriov devices can be used outside of the physical network, or it's desirable to further partition a physical network, it might be necessary to introduce this argument. In that case, pci flavors need to be explicitly created once it's available. This argument may not be needed initially.
- to create a VM with sriov NICs
  - Configuration on the controller node: pci attribute sriov_group
  - configurations on a compute node: physical networks as required by the neutron plugin, pci passthrough device list that are properly tagged with sriov_group.
  - neutron net-create that may be provided with the provider network name, or otherwise uses default physical network name as configured in the neutron plugin
  - neutron port-create --vnic-type direct <net-uuid>
  - nova boot --flavor m1.large --image <image-uuid> --nic port-id=<port-uuid-from-port-create>

Possible Work Items
- Existing BPs that are relevant based on the above
- The above BPs may be redefined or new BPs can be created based on the actual work items.
- Roughly speaking, there are the following work items
  - nova side: a proof of concept patch is captured in here: https://docs.google.com/document/d/196pcKK0iQBJwQfCP0MRaXndjnO-RGfgW1zfeo1YcY4A/edit?pli=1 and https://review.openstack.org/67500
    - Nova server and scheduler changes:
      - although no changes to --nic options for now, the semantics of the port-id parameter is enhanced to support sriov. Basically, if it's a sriov port, request_specs for scheduling should be created and managed.
      - how do we address this requirement that a compute node may support SRIOV port only, a new filter or enhancement to the existing pci passthrough filter? [irenab - May use host aggregate ]
    - Nova compute
      - Nova compute manager to associate a requested network with a request spec, thus corresponds a requested network with a PCI device that is allocated for it
      - neutronv2 API in nova that supports interaction with neutron. Particularly speaking:
        to support the enhanced port binding
        
        to support an enhanced vif dictionary (e.g., vlan id is missing from the dictionary)
      - libvirt driver to support sriov
      - vif driver to support sriov, especially to generate config for sriov device and to generate interface xml
      - to support live migration assuming per interface network xml
  - neutron side:
    - to support the enhanced port binding in the main plugin
    - to enhance the neutron client in support of the enhanced interaction between nova and neutron [irenab - if we use binding:profile, there should be no changes to existing support]
    - to enhance various plugins, especially ml2 plugin to support the enhanced port binding (particularly vnic-type)

Agenda on Jan 15th, 2014

Ian's proposal for flavor and backend tagging: https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit#
Review this against known use cases
Document any use cases not in that document

Agenda on Jan. 14th 2014

PCI group versus PCI flavor: let's sort out what exactly they are, APIs around them, and pros and cons of each.
Please check [1]
Division of works

POC Implementation

See POC implementation

Definitions

A specific PCI attachment (could by a virtual function) is described by:

vendor_id
product_id
address

There is a whitelist (at the moment):

which devices on a specific hypervisor host can be exposed

There is an alias (at the moment):

groupings of PCI devices

The user view of system

For GPU passthrough, we need things like:

user request a "large" GPU
could be from various vendors or product versions

For network, we need things like:

user requests a nic for a specific neutron network
they want to say if it's virtual (the default type) or passthrough (super fast, slow, etc)
this includes groups by address (virtual function, etc) so it's specific to a particular _group_ of neutron network, each with specific configurations (e.g. VLAN id, a NIC attached to a specific provider network)
or it involves a NIC that can be programmatically made to attach to a specific neutron network

The user view of requesting things

For GPU passthrough:

user requests a flavor extra specs *imply* which possible PCI devices can be connected
nova boot --image some_image --flavor flavor_that_has_big_GPU_attached some_name

The admin would expose a flavor that gives you, for example, one large GPU and one small GPU:

nova flavor-key m1.large set "pci_passthrough:alias"=" large_GPU:1,small_GPU:1"
TODO - this may change in the future

For SRIOV:

in the most basic case, the user may be given direct access to a network card, just like we do with GPU, but this is less useful than...
user requests neutron nics, on specific neutron networks, but connected in a specific way (i.e. high speed SRIOV vs virtual)
note that some of the nics may be virtual, some may be passthrough, and some might be a different type of passthrough
nova boot --flavor m1.large --image <image_id> --nic net-id=<net-id>,nic-type=<slow | fast | foobar> <vm-name>
(where slow is a virtual connection, fast is a PCI passthrough, and foobar is some other type of PCI passthrough)
consider several nics, of different types: nova boot --flavor m1.large --image <image_id> --nic net-id=<net-id-1> --nic net-id=<net-id-2>,nic-type=fast --nic net-id=<net-id-3>,nic-type=faster <vm-name>
when hot-plugging hot-unplugging, we also need to specify vnic-type in a similar way
also, this should work nova boot --flavor m1.large --image <image_id> --nic port-id=<port-id>, given
quantum port-create --fixed-ip subnet_id=<subnet-id>,ip_address=192.168.57.101 <net-id> --nic-type=<slow | fast | foobar>

TODO: need agreement, but one idea for admin...

pci_alias_1='{"name":"Cisco.VIC", devices:[{"vendor_id":"1137","product_id":"0071", address:"*", "attach-type":"macvtap"}],"nic-type":"fast"}'
pci_alias_2='{"name":"Fast",devices:[{"vendor_id":"1137","product_id":"0071", address:"*","attach-type":"direct"}, {"vendor_id":"123","product_id":"0081", address:"*","attach-type":"macvtap"}],"nic-type":"faster", }'

New Proposal for admin view

Whitelist:

only certain devices exposed to Nova
just a list of addresses that are allowed (including wildcards)
by default, nothing is allowed
this is assumed to be (mostly) static for the lifetime of the machine
contained in nova.conf

PCI flavors:

specify groups of PCI devices, to be used in Neutron port descriptions or Server flavor extra specs
configured using host aggregates API:
- a combination of whitelist, alias and group
- raw device passthrough (grouped by device_id and product_id)
- network device passthrough (grouped by device address also)
- note: there might be several options for each (GPU v3 and GPU v4 in a single flavor)
only servers in the aggregate will be considered by the scheduler for each PCI flavor
these are shared across the whole child cell (or if no cells, whole nova deploy)

Scheduler updates:

on periodic update, report current status of devices
if any devices are in the whitelist, look up host aggregates to check what device types to report
report the number of free devices per PCI flavor
device usage tracked by resource manager as normal, looking at all devices in whitelist

On attach of PCI device:

scheduler picks host it things has a free device
check with resource manage in usual way
assign device to VM
ignoring migration for now

On attach of VIF device (through boot or otherwise):

TBD... very sketchy plan...
ideally neutron port contains associated PCI flavor/alias, or its assumed to be a virtual port
neutron supplies the usual information, VLAN-id, etc
neutron and nova negotiate which VIF driver to use, in usual way, given extra info about nic-type from PCI alias settings, etc
VIF driver given a hypervisor agnostic lib to attach the PCI device, extracted from Nova attach PCI device code
VIF driver is free to configure the specific PCI device before attaching it using the callback into the Nova driver (or modify Nova code to extend the create API)

Agenda on Jan. 8th, 2014

Let's go over the key concepts and use cases. In the use cases, neutron or neutron plugin specific configurations are not mentioned.

Key Concepts

Note that paragraphs with Itablic text were added by Iawells as comments.

PCI Groups

A PCI group is a collection of PCI devices that share the same functions or belong to the same subsystem in a cloud.

In fact, two proposals exist for PCI group definition - via API, with the implication that they're stored centrally in the database, and via config, specifically a (match expression -> PCI group name) in the compute node configuration. A competing proposal is PCI aliases, which work on the current assumption that all PCI device data is returned to the database and PCI devices can be selected by doing matching at schedule time and thus a name -> match expression mapping is all that need be saved. Thus the internal question of "should all device information be returned to the controller" drives some of the design options.

it's worth mentioning that using an API to define PCI groups make them owned by the tenant who creates them.

Pre-defined PCI Groups

For each PCI device class that openstack supports, a PCI group is defined and associated with the PCI devices belonging to that device class. For example, for the PCI device class net, there is a predefined PCI group named net

User-defined PCI Groups

User can define PCI groups using a Nova API.

PCI Passthrough List (whitelist)

Specified on a compute node to define all the PCI passthrough devices and their associated PCI groups that are available on the node.
blacklist (exclude list) may be added later if deemed necessary.

vnic_type:

virtio: a virtual port that is attached to a virtual switch
direct: SRIOV without macvtap
macvtap: SRIOV with macvtap

This configuration item is not essential to PCI passthrough. It's also a Neutron configuration item.

nova boot: new parameters in --nic option

vnic-type=“vnic” | “direct” | “macvtap”
pci-group=pci-group-name
port-profile=port-profile-name This property is not related directly to use of PCI passthrough for networks. It is a requirement of 802.1BR-based systems.

neutron port-create: new arguments

--vnic-type “vnic” | “direct” | “macvtap”
--pci-group pci-group-name
port-profile port-profile-name

Nova SRIOV Configuration

vnic_type = <vnic-type>: specified in controller node to indicate the default vnic-type that VMs will be booted with. default value is "vnic"
sriov_auto_all = <on | off>: specified in compute nodes to indicate that all sriov capable ports are added into the ‘net’ PCI group.
sriov_only = <on | off>: specified in compute nodes to indicate that nova can only place VMs with sriov vnics onto these nodes. Default value is on for nodes with SRIOV ports.
sriov_pci_group = <pci-group-name>: specified in compute nodes in which all of its SRIOV ports belong to a single pci group.

The SRIOV cofiguration items are enhancements to the base proposal that make it much easier to configure compute hosts where it is known that all VFs will be available to cloud users.

Use Cases

These use cases do not include non-network passthrough cases.

SRIOV based cloud

All the compute nodes are identical and all the NICs are SRIOV based
All the NICs are connected to the same physical network

In this cloud, the admin only needs to specify vnic_type=direct on the controller and sriov_auto_all=on on the compute nodes in the nova configuration file. In addition, the new arguments introduced in the nova boot command are not required.

A cloud with mixed Vnics

On compute nodes with sriov ports only, set sriov_auto_all = on
On compute nodes without sriov ports, no change is required

In such a cloud, when booting a VM with sriov vnic, the nova boot command would look like:

   nova boot --flavor m1.large --image <image_id>
                          --nic net-id=<net-id>,vnic-type=direct <vm-name>

This will require some minimum change in the existing applications.

A Cloud that requires multiple SRIOV PCI groups

create all the pci-groups in the cloud by invoking a Nova API
on compute nodes that support a single pci group and in which all of the SRIOV ports belong to this group, set sriov_auto_all=on, sriov_pci_group=<group_name>
on compute nodes that support multiple pci groups, define the pci-passthrough-list

In such a cloud, when booting a VM with sriov macvtap, the nova boot command would look like:

    nova boot --flavor m1.large --image <image_id> 
                   --nic net-id=<net-id>,vnic-type=macvtap,pci-group=<group-name> <vm-name>

Introducing new compute nodes with SRIOV into an existing cloud

Depending on the cloud and the compute node being introduced:

it could be as simple as adding sriov_auto_all=on into the nova config file
it could be setting sriov_auto_all=on and pci_group=<group_name>
it could be defining the pci-passthrough-list.

NIC hot plug

Evolving Design Doc

https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs/edit?usp=sharing

Ian typed up a complete proposal in two sections in that document, which is pulled out here: https://docs.google.com/document/d/1svN89UXKbFoka0EF6MFUP6OwdNvhY4OkdjEZN-rD-0Q/edit# - this proposal takes the 'PCI groups via compute node config' approach and makes no attempt at proposing APIs.

Previous Meetings

http://eavesdrop.openstack.org/meetings/pci_passthrough_meeting/2013/pci_passthrough_meeting.2013-12-24-14.02.log.html http://eavesdrop.openstack.org/meetings/pci_passthrough/

Meeting log on Dec. 17th, 2013

Meetings/Passthrough/dec-17th-2013.log

Meeting log on June. 17th, 2014

http://eavesdrop.openstack.org/meetings/passthrough/2014/passthrough.2014-06-17-13.07.log.html