PCI passthrough SRIOV support
Contents
Background
This design is based on the PCI passthrough IRC meetings: https://wiki.openstack.org/wiki/Meetings/Passthrough This document was used to finalise the design: https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit# link back to bp: https://blueprints.launchpad.net/nova/+spec/pci-extra-info
PCI devices have PCI standard properties like address (BDF), vendor_id, product_id, etc, Virtual functions also have a property referring to the function's physical address. Application specific or installation specific extra information can be attached, like physical network connectivity for use by Neutron SRIOV, or any other property.
All kind of these PCI property should be well classify, and for every property what's the scope it belong to should be well defined. Our design will focus on several PCI modules to provide the PCI pass-through SRIOV support, it's current functionality is:
- on compute node the whit-list define a spec which filter the PCI device, given a set of PCI device which is available for allocation.
- PCI compute also report the PCI stats information to scheduler. PCI stats contain several pools. each pool defined by several PCI property (vendor_id, product_id, extra_info).
- PCI alias define the user point of view PCI device selector: alias provide a set of (k,v) to form specs to select the available device(filter by white list).
PCI NEXT over all design
PCI flavor used to select available device
User use the PCI flavor to select available device. and the PCI flavor is global to all cloud, and can be configuration via API. user treat the PCI flavor as an resealable name like 'oldGPU', 'FastGPU', '10GNIC', 'SSD'.
define the flavors on Control node
Control node has flavors which allow the administrator to package up devices for users. flavors have a name, and matching expression that selects available(offer by white list) devices. flavors can overlap - that is, the same device on the same machine may be matched by multiple flavors.
PCI flavor defined by a set of (k,v), the k is the *well defined* PCI property. not every PCI property is available to PCI flavor, only a specific set of PCI property can used to define the PCI flavor, it's global to cloud. these PCI property is defined via a global configuration :
pci_flavor_attrs = vendor_id, product_id, ...
Only the global attrs in should be one of pci_falvor_attrs, like the vendor, product, etc, the 'host', and 'BDF' of a pci device should not be used as pci_flavor_attrs. this is explicitly an optimization to simplify scheduling complexity, pci stats pool also constructed base on pci_flavor_attrs list.
Compute node offers up devices via local config
the compute nodes offer available PCI devices for pass-through, since the list of devices doesn't usually change unless someone tinkers with the hardware, this matching expression used to create this list of offered devices is stored in compute node config.
*the device information (device ID, vendor ID, BDF etc.) is discovered from the device and stored as part of the PCI device, same as current implement. *on the compute node, additional arbitrary information, in the form of key-value pairs, can be added to the config and is included in the PCI device
this is achived by extend the pci white-list to: pci_information = { pci-regex,pci-extra-attrs } pci-regex is a dict of { string-key: string-value } pairs , it can only match device properties, like vendor_id, address, product_id,etc. pci-extra-attrs is a dict of { string-key: string-value } pairs. The values can be arbitrary The total size of the extra attrs may be restricted.
[irenab] I would suggest to change the pci_information to more meaningful name. What you mean here is the list of PCI devices with attached information that is available for allocation on this compute node. I think it should be something like 'available_pci_devices_list', but any other name that makes its role more clear will be fine
PCI NEXT Config
Compute host
pci_information = [ {pci-regex},{pci-extra-attrs} ]
Control node
pci_flavor_attrs=attr,attr,attr For instance, when using device and vendor ID this would read:
pci_flavor_attrs=device_id,vendor_id
When the backend adds an arbitrary ‘group’ attribute to all PCI devices:
pci_flavor_attrs=e.group
When you wish to find an appropriate device and perhaps also filter by the connection tagged on that device, which you use an extra-info attribute to specify on the compute node config: pci_flavor_attrs=device_id,vendor_id,e.connection
flavor API
- overall
nova pci-flavor-list nova pci-flavor-show name|UUID <name|UUID> nova pci-flavor-create name|UUID <name|UUID> description <desc> nova pci-flavor-update name|UUID <name|UUID> set 'description'='xxxx' 'e.group'= 'A' nova pci-flavor-delete <name|UUID> name|UUID
* list available pci flavor (white list) nova pci-flavor-list GET v2/{tenant_id}/os-pci-flavors data: os-pci-flavors{ [ { 'UUID':'xxxx-xx-xx' , 'description':'xxxx' 'vendor_id':'8086', .... 'name':'xxx', } , ] }
- get detailed information about one pci-flavor:
nova pci-flavor-show <UUID> GET v2/{tenant_id}/os-pci-flavor/<UUID> data: os-pci-flavor: { 'UUID':'xxxx-xx-xx' , 'description':'xxxx' .... 'name':'xxx', }
- create pci flavor
nova pci-flavor-create name 'GetMePowerfulldevice' description "xxxxx" API: POST v2/{tenant_id}/os-pci-flavors data: pci-flavor: { 'name':'GetMePowerfulldevice', description: "xxxxx" } action: create database entry for this flavor.
- update the pci flavor
nova pci-flavor-update UUID set 'description'='xxxx' 'e.group'= 'A' PUT v2/{tenant_id}/os-pci-flavors/<UUID> with data : { 'action': "update", 'pci-flavor': { 'description':'xxxx', 'vendor': '8086', 'e.group': 'A', .... } } action: set this as the new definition of the pci flavor.
- delete a pci flavor
nova pci-flavor-delete <UUID> DELETE v2/{tenant_id}/os-pci-flavor/<UUID>
nova command extension : --nic with pci-flavor
attaches a virtual NIC to the Neutron network and the VM nova boot --nic net-id=neutron-network,vnic-type=macvtap,pci-flavor=xxx nova boot --nic net-id=neutron-network,vnic-type=macvtap,pci-flavor=xxx
Use cases
General PCI pass through
given compute nodes contain 1 GPU with vendor:device 8086:0001
- on the compute nodes, config the pci_information
pci_information = { { 'device_id': "8086", 'vendor_id': "0001" }, {} }
- on controller
pci_flavor_attrs = ['device_id', 'vendor_id']
the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report one pool: {'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
- create PCI flavor
nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '0001'
- create flavor and boot with it
nova flavor-key m1.small set pci_passthrough:pci_flavor= 1:bigGPU nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec
General PCI pass through with multi PCI flavor candidate
given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002
- on the compute nodes, config the pci_information
pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
- on controller
pci_flavor_attrs = ['device_id', 'vendor_id']
the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report 2 pool:
{'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }
- create PCI flavor
nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '0001'
nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU2' set 'vendor_id'='8086' 'product_id': '0002'
- create flavor and boot with it
nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU,bigGPU2;' nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec
General PCI pass through wild-cast PCI flavor
given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002
- on the compute nodes, config the pci_information
pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
- on controller
pci_flavor_attrs = ['device_id', 'vendor_id']
the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report 2 pool:
{'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }
- create PCI flavor
nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '000[1-2]'
- create flavor and boot with it
nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU;' nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec
PCI pass through support grouping tag
given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002
- on the compute nodes, config the pci_information
pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, { 'e.group' => 'gpu' } } [irenab] should it be {'e.group' : 'gpu'} ?
- on controller
pci_flavor_attrs = ['e.group']
the compute node would report PCI stats group by ('e.group'). pci stats will report 1 pool:
{'e.group':'gpu', 'count': 2 }
- create PCI flavor
nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'e.group'='gpu'
- create flavor and boot with it
nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU;' nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec
PCI SRIOV with tagged flavor
given compute nodes contain 5 PCI NIC , vendor:device 8086:0022, and it connect to physical network "X".
- on the compute nodes, config the pci_information
pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, { 'e.physical_netowrk' => 'X' } }
- on controller
pci_flavor_attrs = ['e.physical_netowrk']
the compute node would report PCI stats group by ('e.group'). pci stats will report 1 pool:
{'e.physical_netowrk':'X', 'count': 1 }
- create PCI flavor
nova pci-flavor-create name 'phyX_NIC' description 'passthrough NIC connect to physical network X' nova pci-flavor-update name 'bigGPU' set 'e.physical_netowrk'='X'
[irenab] I guess we all agree this is the very short term solution, we should follow plan B (neutron aware scheduler) to make it right. the user definitely should not deal with connectivity concerns
- create flavor and boot with it
nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec --nic net-id=network_X pci_flavor= '1:phyX_NIC;'
Implement the PCI next Design
concept introduce here: spec: a filter defined by (k,v) paris, which k in the pci object fileds, this means those (k,v) is the pci device property like: vendor_id, 'address', pci-type etc. extra_spec: the filter defined by (k, v) and k not in the pci object fileds.
pci utils/objects support extra tag
* pci utils k,v match support the address reduce regular expression * objects provide a class level extract interface to extract base spec and extra spec * extra information also should use schema 'e.name'
PCI infomation(extended the witelist) support extra tag
* PCI infomation support reduce regular expression compare, match the pci device * PCI infomation support store any other (k,v) pair pci device's extra info * any extra tag's k, v is string.
pci_flavor_attrs
* implement the attrs parser
support pci-flavor
* pci-flavor store in DB * pci-flavor config via API * pci manager use extract method extract the specs and extra_specs, match them against the pci object & object.extra_info.
PCI scheduler filter
When scheduling, marcher should applied regular expression stored in the named flavor, this read out from DB.
pci stats grouping device base on pci_flavor_attrs
* current gourping base on [vendor_id, product_id, extra_info] * going to gourping by key specified by pci_flavor_attrs.
transite config file to API
- the config file for alias and whitelist definition is going to deprecated.
- new config pci_information will replace whitelist
- pci flavor will replace alias
- *white list/alias schema still work
* And also given a deprecated notice, alias will fade out which will be remove start from next release.
with this solution, we move PCI flavor from config file to API.
[irenab] What if pci_attrs is not defines?
DB for pci configration
each pci flavor will be a set of (k,v), and the PCI flavor's use-able PCI attrs is configurable , so store the (k,v) pair in DB rather than a PCI flavor's object. both k, v is string, and the value could be a reduced regular expression, support wild-cast, range operators.
Talbe: pci_flavor{
UUID: which pci-flavor the k,v belong to name: the pci flavor's name, we need this filed to index the DB with flavor's name key value (might be a simple string value or reduce Regular express) }
DB interface:
get_pci_flavor_by_name get_pci_flavor_by_UUID get_pci_flavor_all update_pci_flavor_by_name update_pci_flavor_by_UUID delete_pci_flavor_by_name delete_pci_flavor_by_UUID