Jump to: navigation, search

Difference between revisions of "PCI passthrough SRIOV support"

(Requements from SRIOV)
Line 1: Line 1:
  
 
===background===
 
===background===
PCI devices has not only PCI standard property like BDF, vendor_id etc, it also has some extra information which may be application specific. For example, attached network switch for NIC, or resolution for GPU etc.These information can't be achieved through hypervisor, and may be provided externally through like configuration file.
+
this design based on this PCI passthrough meeting:
 
+
https://wiki.openstack.org/wiki/Meetings/Passthrough
Currently nova PCI support has basic support for such extra information in database and object layer. But we need more effort to it, including: get such information from configuration file, group devices with same extra information value etc.
+
https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit#
 
 
this design based on this discsstion docs, the part which achieve agreement :
 
https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs/edit?pli=1#heading=h.30de7p6sgoxp
 
 
 
 
link back to bp:
 
link back to bp:
 
https://blueprints.launchpad.net/nova/+spec/pci-extra-info
 
https://blueprints.launchpad.net/nova/+spec/pci-extra-info
  
 +
PCI devices has PCI standard property like address(BDF), vendor_id, product_id, etc, and PCI property like every Virtual Function's physical address  , it also should can be attach some application specific extra information, like physical network connectivity used by neutron  SRIOV, or any other property.
  
===PCI configration API use cases ===
+
All kind of these PCI property should be well classify, and for every property what's the scope it belong to should be well defined.   our design will focus on several PCI modules to provide the PCI pass-through SRIOV support, it's current functionality is:
To get a better between user and amdin, and remove redudent code from pci, alias will fade out, white-list will be used to map devices to an  pci-flavor:  the group to use for scheduler and configrationthis approach keep the  possibility to take advange of aggregate.
 
 
 
  
User will see flavors like:
+
* on compute node the whit-list define a spec which filter the PCI device, given a set of PCI device which is available for allocation.
* flavor that gives you a cheap GPU
+
* PCI compute also report the PCI stats information to scheduler. PCI stats contain several pools. each pool defined by several PCI property (vendor_id, product_id, extra_info).
* flavor that gives you a big GPU
+
* PCI alias define the user point of view PCI device selector: alias provide a set of (k,v) to form specs to select the available device(filter by white list).
* flavor that gives you two SSD disks (of varying types depending on where it lands) and a big GPU
 
* flavor that gives you your public network via SRIOV (which could be one of several makes of network card, depending on the host picked)
 
  
And for the admin...
+
===PCI NEXT over all design ===
  
Admin sees:
+
====PCI flavor used to select available device====
 +
User use the PCI flavor to select available device. and the PCI flavor is global to all cloud, and can be configuration via API.  user treat the PCI flavor as  an resealable name like 'oldGPU', 'FastGPU',  '10GNIC', 'SSD'. 
  
* per host devices (adding things to os-host/<host>/pci-device)
 
** lists pci devices present
 
** lists pci devices that are exposed to users, and which are in use or free
 
  
* per pci-flavor
+
=====define the  flavors on Control node ====
** creates a pci-device description
+
Control node has flavors which allow the administrator to package up devices for users. flavors have a name, and matching expression that selects available(offer by white list) devices. flavors can overlap - that is, the same device on the same machine may be matched by multiple flavors.
** specifies vendor-id, address, uuid, name, etc
 
** this used to be a combination of alias and whitelist
 
** this could be overlapping descriptions
 
  
* flavor extra specs
+
PCI flavor defined by a set of (k,v), the k is the *well defined* PCI property. not every PCI property is available to PCI flavor,  only a specific set of PCI property can used to define the PCI flavor. these PCI property is defined via a global configuration :
** this has entries that describe:
 
*** list of possible pci-flavors that could be picked
 
*** use the key: pci-passthrough:<label_just_to_make_uniqe value:<pci-flavor-uuid-1>,<pci-flavor-uuid-2>
 
** for multiple devices, you just add multiple entries
 
  
Take advantage of host aggregate:
+
    pci_flavor_attrs = [ vendor_id, product_id, ...]
* host aggregates used to map hosts to pci-flavor
 
** use host aggregates to expose specific pci-flavors as available on a particular host
 
  
====Use cases ====
+
Only the global attrs should be one of pci_falvor_attrs, like the vendor, product, etc, the 'host', and 'BDF' of a pci device should not be used as pci_flavor_attrs. this is explicitly an optimization to simplify scheduling complexity, we may change this to an API-changeable value in the future, though we believe it will rarely be necessary to change its value.
  
=====admin check PCI devices present per host =====
+
====Compute node offers up devices via local config ====
admin might want to know if there are some pci device avaliable to useit's convenience  for  admin to know such infomation.
+
the compute nodes offer available PCI devices for pass-throughsince the list of devices doesn't usually change unless someone tinkers with the hardware, this matching expression used to create this list of offered devices is stored in compute node config.
 +
    *the device information (device ID, vendor ID, BDF etc.) is discovered from the device and stored as part of the PCI device, same as current implement.
 +
    *on the compute node, additional arbitrary information, in the form of key-value pairs, can be added to the config and is included in the PCI device
  
    nova  host-list pci-device
+
this is achived by extend the pci white-list to:   
    GET  v2/​{tenant_id}​/os-hosts/​{host_name}​/os-pci-devices
+
pci_information = { pci-regex,pci-extra-attrs }
 +
pci-regex is a dict of { string-key: string-value } pairs , it can only match device properties, like vendor_id, address, product_id,etc.
 +
pci-extra-attrs is a dict of { string-key: string-value } pairs.  The values can be arbitrary  The total size of the extra attrs may be restricted.
 +
   
 +
===PCI NEXT Config  ===
  
return a summary infomation about pci devices on this host:
+
====Compute host====
 +
pci_information = { pci-regex,pci-extra-attrs }
  
        os-pci-devices:{
+
====Control node====
                      [
+
pci_flavor_attrs=[attr,attr,attr]
                              {
+
For instance, when using device and vendor ID this would read:
                                  'vendor_id':'8086',
+
    pci_flavor_attrs=device_id,vendor_id
                                  'product_id':'xxx',
+
When the backend adds an arbitrary ‘group’ attribute to all PCI devices:
                                    'address': '0000:01:00.7',
+
    pci_flavor_attrs=e.group
                                    ...
+
When you wish to find an appropriate device and perhaps also filter by the connection tagged on that device, which you use an extra-info attribute to specify on the compute node config:
                                    'pci-type'VF',
+
pci_flavor_attrs=device_id,vendor_id,e.connection
                                    'status': 'avaliable' ,
 
                                },  
 
                      ]
 
        }
 
  
to find which pci-flavor this device belong to, status, we had to query the database.
+
====flavor API====
  
currently pci device in the databse is after filter, if want inpect the device on node from db, we should let all device going in to database. db will too large then eventually slow down query, became a scale problem. so we'd use RPC call for this goal, use RPC call to get reulst from a node, show it to admin.
+
* overall
 +
nova pci-flavor-list
 +
nova pci-flavor-show    name|UUID  <name|UUID>
 +
nova pci-flavor-create name|UUID  <name|UUID>  description <desc>
 +
nova pci-flavor-update  name|UUID  <name|UUID>  set    'description'='xxxx'  'e.group'= 'A'
 +
nova pci-flavor-delete <name|UUID>  name|UUID
  
=====admin check avaliable pci flavor (white-list)  =====
 
  
* list all avaliable pci flavor (whitelist)
+
* list available pci flavor (white list)
 
     nova pci-flavor-list  
 
     nova pci-flavor-list  
 
 
     GET v2/​{tenant_id}​/os-pci-flavors
 
     GET v2/​{tenant_id}​/os-pci-flavors
 
     data:
 
     data:
     os-pci-flavors:{
+
     os-pci-flavors{
                  [  
+
                [  
                            {  
+
                            {  
                                'UUID':'xxxx-xx-xx' ,  
+
                                'UUID':'xxxx-xx-xx' ,  
                                'description':'xxxx'  
+
                                'description':'xxxx'  
                                   .....
+
                                'vendor_id':'8086',
                                  'pci-flavor':'xxx',  
+
                                   ....
                                } ,
+
                                'name':'xxx',
                ]
+
                              } ,
 +
                ]
 +
    }
 +
 
 +
 
 +
* get detailed information about one pci-flavor:
 +
      nova pci-flavor-show  <UUID>
 +
    GET v2/​{tenant_id}​/os-pci-flavor/<UUID>
 +
    data:
 +
        os-pci-flavor: {
 +
                                'UUID':'xxxx-xx-xx' ,
 +
                                'description':'xxxx'
 +
                                  ....
 +
                                'name':'xxx',  
 +
          }
 +
 
 +
* create pci flavor
 +
  nova pci-flavor-create  name 'GetMePowerfulldevice'  description "xxxxx"
 +
  API:
 +
  POST  v2/​{tenant_id}​/os-pci-flavors
 +
  data:
 +
      pci-flavor: {
 +
            'name':'GetMePowerfulldevice',
 +
              description: "xxxxx"
 
       }
 
       }
 +
  action:  create database entry for this flavor.
 +
 +
 +
*update the pci flavor
 +
    nova pci-flavor-update UUID  set    'description'='xxxx'  'e.group'= 'A'
 +
    PUT v2/​{tenant_id}​/os-pci-flavors/<UUID>
 +
    with data  :
 +
        { 'action': "update",
 +
          'pci-flavor':
 +
                          {
 +
                            'description':'xxxx',
 +
                            'vendor': '8086',
 +
                            'e.group': 'A',
 +
                              ....
 +
                          }
 +
        }
 +
    action: set this as the new definition of the pci flavor.
 +
 +
* delete a pci flavor
 +
  nova pci-flavor-delete <UUID>
 +
  DELETE v2/​{tenant_id}​/os-pci-flavor/<UUID>
 +
 +
====nova command extension : --nic with pci-flavor ====
 +
 +
attaches a virtual NIC to the Neutron network and the VM
 +
nova boot --nic net-id=neutron-network,vnic-type=macvtap,pci-flavor=xxx
 +
nova boot --nic net-id=neutron-network,vnic-type=macvtap,pci-flavor=xxx
 +
 +
===Use cases ===
  
 +
==== General PCI pass through  =====
 +
given compute nodes contain 1 GPU with vendor:device 8086:0001
  
* list avaliable pci flavor on host (white list)
+
*on the compute nodes, config the pci_information
     nova host-list pci-flavor
+
     pci_information =  { { 'device_id': "8086", 'vendor_id': "0001" }, {} }
  
    GET  v2/​{tenant_id}​/os-hosts/​{host_name}​/os-pci-flavors
+
* on controller
    data:
+
  pci_flavor_attrs = ['device_id', 'vendor_id']
    os-pci-flavors{
+
 
                  [
+
the compute node would report PCI stats group by ('device_id', 'vendor_id').
                            {  
+
pci stats will report one pool:  {'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
                                    'pci_flavor_uuid': <uuid>,
+
 
                                    total: 10,
+
* create PCI flavor
                                    available: 6,
+
nova pci-flavor-create  name 'bigGPU'  description 'passthrough Intel's on-die GPU'
                                    in_use: 2
+
nova pci-flavor-update  name 'bigGPU'  set    'vendor_id'='8086'  'product_id': '0001'
                                } ,
 
                ]
 
      }
 
  
 +
* create flavor and boot with it
 +
nova flavor-key m1.small set pci_passthrough:pci_flavor= 1:bigGPU
 +
nova boot  mytest  --flavor m1.tiny  --image=cirros-0.3.1-x86_64-uec
  
* get detailed infomation about one pci-flavor:
+
==== General PCI pass through with multi PCI flavor candidate =====
 
    nova pci-flavor-show  <UUID>
 
 
 
    GET v2/​{tenant_id}​/os-pci-flavor/<UUID>
 
    data:
 
        os-pci-flavor: {
 
                                'UUID':'xxxx-xx-xx' ,
 
                                'description':'xxxx'
 
                                  ...
 
                                  'pci-flavor':'xxx',
 
          }
 
  
=====admin create a  pci flavor (white-list)  =====
+
given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002
  
#  create flavor
+
*on the compute nodes, config the pci_information
  nova pci-flavor-create name 'GetMePowerfulldevice' description "xxxxx"
+
    pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
  
  API:
+
* on controller
  POST  v2/​{tenant_id}​/os-pci-flavors
+
   pci_flavor_attrs = ['device_id', 'vendor_id']
    
 
  data:
 
      pci-flavor: {
 
              'name':'GetMePowerfulldevice',
 
              description: "xxxxx"
 
      }
 
  action:  create database entry for this flavor.
 
  
#  update flavor defination
+
the compute node would report PCI stats group by ('device_id', 'vendor_id').
    nova pci-flavor-update UUID set    'description'='xxxx'   'address'= '0000:01:*.7', 'host'='compute-id'
+
pci stats will report 2 pool:
     
+
  {'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
    PUT v2/​{tenant_id}​/os-pci-flavors/<UUID>
+
{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }
    with data  :
 
          { 'action': "update",
 
            'pci-flavor':
 
                          {
 
                              'description':'xxxx',
 
                              'address': '0000:01:*.7'}
 
                          }
 
          }
 
    action: set this as the new defination of the pci flavor.
 
  
=====Take advantage of host aggregate =====
+
* create PCI flavor
 +
nova pci-flavor-create  name 'bigGPU'  description 'passthrough Intel's on-die GPU'
 +
nova pci-flavor-update name 'bigGPU'  set    'vendor_id'='8086'  'product_id': '0001'
  
host aggregate can be used to enhancement the scheduler for PCI.
+
nova pci-flavor-create  name 'bigGPU'  description 'passthrough Intel's on-die GPU'
 +
nova pci-flavor-update  name 'bigGPU2'  set    'vendor_id'='8086'  'product_id': '0002'
  
* create aggregate
+
* create flavor and boot with it
    nova aggregate-create pci-aware-group
+
nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU,bigGPU2;'
    nova aggregate-add-host  host1
+
nova boot  mytest --flavor m1.tiny  --image=cirros-0.3.1-x86_64-uec
    nova aggregate-add-host  host2
 
  
* map flavor to aggregate
+
==== General PCI pass through wild-cast PCI flavor =====
    nova aggregate-set-metadata pci-aware-group set 'pci-flavor'='intelNICpublic, intelNICprivate, nvidiaGPUnew, nvidiaGPUolder'
 
  
    this means all hosts in the aggregate can provide these pci-flaovr if the host had free one. and this infomation also usefull for pci flavor filter on these hosts, we can check only these flavor on these hosts, don't need check each flavor we had in DB.
+
given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002
  
* set instance flavor key to enhancement PCI scheduler
+
*on the compute nodes, config the pci_information
     nova flavor-create --is-public true m1.iwantPCI 100 2048 20 2
+
     pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
    nova flavor-key 100 set 'pci-flavor='1:intelNICprivate; 1:intelNICprivate; 1:nvidiaGPUnew, nvidiaGPUolder'
 
  
these information can use to select the aggreate, or try keep instance not scheule to those host if instance don't want the pci passthrough.
+
* on controller
 +
  pci_flavor_attrs = ['device_id', 'vendor_id']
  
=====admin delete a  pci flavor (white-list=====
+
the compute node would report PCI stats group by ('device_id', 'vendor_id').
 
+
pci stats will report 2 pool:
    nova pci-flavor-delete <UUID>
+
  {'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
 +
{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }
  
    API will be:
+
* create PCI flavor
    DELETE v2/​{tenant_id}​/os-pci-flavor/<UUID>
+
nova pci-flavor-create  name 'bigGPU'  description 'passthrough Intel's on-die GPU'
    flow: delete it from database
+
nova pci-flavor-update  name 'bigGPU'  set   'vendor_id'='8086'  'product_id': '000[1-2]'
    
 
=====admin configures extra spec in flavor request pci device =====
 
  
to allocate the device from a pci flavor, just fill pci flavor into the flavor's extra spec:
+
* create flavor and boot with it
    nova flavor-key 100 set 'pci-flavor='1:intelNICprivate; 1:intelNICpublic'
+
nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU;'
 +
nova boot  mytest  --flavor m1.tiny  --image=cirros-0.3.1-x86_64-uec
  
=====admin boot VM with this  flavours =====
+
====   PCI pass through support grouping tag =====
    nova boot  mytest  --flavor m1.small  --image=cirros-0.3.1-x86_64-uec
 
  
=====admin configures SRIOV flavor =====
+
given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002
  
* create a pci flavor for the SRIOV
+
*on the compute nodes, config the pci_information
  nova pci-flavor-create name 'vlan-SRIOV' description "xxxxx"
+
    pci_information = { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
  nova pci-flavor-update UUID  set    'description'='xxxx'  'address'= '0000:01:*.7'
 
  
 +
* on controller
 +
  pci_flavor_attrs = ['device_id', 'vendor_id']
  
=====Admin config SRIOV=====
+
the compute node would report PCI stats group by ('device_id', 'vendor_id').
 +
pci stats will report 2 pool:
 +
{'device_id':'0001', 'vendor_id':'8086', 'count': 1 }
 +
{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }
  
* create pci-flavor :
+
* create PCI flavor
    {"name": "privateNIC", "neutron-network-uuid": "uuid-1", ...}
+
nova pci-flavor-create  name 'bigGPU'  description 'passthrough Intel's on-die GPU'
    {"name": "publicNIC", "neutron-network-uuid": "uuid-2", ...}
+
nova pci-flavor-update  name 'bigGPU'  set    'vendor_id'='8086'  'product_id': '000[1-2]'
    {"name": "smallGPU", "neutron-network-uuid": "", ...}
 
  
* set aggregate meta according the flavors existed in the hosts
+
* create flavor and boot with it
flavor extra-specs, for a VM that gets two small GPUs and VIFs attached from the above SRIOV NICs:
+
nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU;'
    nova aggregate-set-metadata pci-aware-group set 'pci-flavor'='smallGPU,oldGPU, privateNIC,privateNIC'
+
nova boot  mytest  --flavor m1.tiny  --image=cirros-0.3.1-x86_64-uec
  
* create instance flavor for sriov
 
    nova flavor-key 100 set  'pci-flavor='1:privateNIC;  1: publicNIC;  2:smallGPU,oldGPU'
 
  
*User just specifies a quantum port as normal:
+
<ijw> In this, pci_attrs is 'group' and we're adding a 'group' attr at the pci_information stage
    nova boot --flavor "sriov-plus-two-gpu" --image img --nic net-id=uuid-2 --nic net-id=uuid-1 vm-name
+
* sarob 已退出(Ping timeout: 252 seconds)
 +
<ijw> So, pci_information={ { device_id: 1, vendor_id: 0x8086}, { group => 'gpu' } }
 +
<baoli> what is device_id?
 +
<ijw> The device ID of the whitelisted PCI device
 +
<ijw> Do I mean device ID?
 +
<ijw> Yes, that's the right term
 +
* ruhe 改名为 ruhe_away
 +
<baoli> but we don't have a device_id
 +
<heyongli> product_id , i think it is
 +
<heyongli> the standard pci property
 +
<ijw> OK, sorry - I'm looking at a (non-Openstack) page which calls it device ID, my apologies
 +
<ijw> So, compute host gets 'group' for attrs when it requests it
 +
<ijw> And makes one PCI stat per 'group' value, rather than unique (vendor_id, product_id) tuple and reports that to the scheduler in this case
 +
<ijw> PCI flavor is going to be name gpu match expression { group: 'gpu' }
 +
<ijw> Scheduling is as before
 +
<ijw> Sorry, it will be 'e.group': 'gpu' per the spec about marking extra info when we use it
 +
* ruhe_away 改名为 ruhe
 +
<ijw> Difficult cases.
 +
<ijw> Scheduling is harder when PCI flavors are vague and match multiple pci_stats rows, and particularly when they overlap
 +
* coolsvap (~SwapnilK@14.97.34.59) 进入了 #openstack-meeting-alt
 +
<ijw> So, if I have two flavors, one is vendor:8086, device: 1 or 2 and one of which is vendor: 8086, device: 1 then you have more problems and the worst case is when you use both flavors in a single instance
 +
<ijw> So, say I have the above two flavors, 'vague' and 'specific' for the sake of argument, and I want to start an instance with one device from each
 +
<irenab> ijw: can we go over networking case taking into account phy_net connectivity?
 +
<ijw> OK, in a sec, I'll finish this case first
 +
<irenab> ijw: afraid to be out of time
 +
<ijw> When I try and schedule, let's assume I have one product_id of each type
 +
* amrith 已退出(Ping timeout: 245 seconds)
 +
<ijw> Then there's a case that I can't use the device '1' for 'vague' because I would not be able to allocate anything for 'specific'.  The problem is hard, it's not insoluble, you just have to try multiple combinations until you succeed.
  
the uuid-1 and uuid-2 map to a "provider" network (with VLAN config, etc) that gets implemented using the privateNIC and publicNIC flavors, we bind the flavor to the network uuid alredy via  "neutron-network-uuid" key, network specific code can identify the deivce binding to that network/interface.
 
  
 
====transite config file to API ====
 
====transite config file to API ====
Line 232: Line 274:
 
             }
 
             }
  
====API interface====
 
 
*  get pci devices infomation on host
 
  nova  host-list pci-device
 
  GET  v2/​{tenant_id}​/os-hosts/​{host_name}​/os-pci-devices
 
  return a summary infomation about pci devices on this host:
 
  os-pci-devices:{
 
                    [ 
 
                            {
 
                                  'vendor_id':'8086',
 
                                  'product_id':'xxx', 
 
                                  ....
 
                                  'pci-type'VF',
 
                                    'status': 'avaliable' ,
 
                              },
 
                    ]
 
        }
 
 
 
* list avaliable pci flavor  (white list)
 
    nova pci-flavor-list
 
    GET v2/​{tenant_id}​/os-pci-flavors
 
    data:
 
    os-pci-flavors{
 
                [
 
                            {
 
                                'UUID':'xxxx-xx-xx' ,
 
                                'description':'xxxx'
 
                                'vendor_id':'8086',
 
                                  ....
 
                                'pci-flavor':'xxx',
 
                              } ,
 
                ]
 
    }
 
 
 
 
*list avaliable pci flavor on host (white list)
 
 
  nova host-list pci-flavor
 
  GET  v2/​{tenant_id}​/os-hosts/​{host_name}​/os-pci-flavors
 
  data:
 
    os-pci-flavors{
 
                [
 
                            {
 
                                  'pci_flavor_uuid': <uuid>,
 
                                    total: 10, 
 
                                    available: 6,
 
                                    in_use: 2
 
                              } ,
 
                ]
 
    }
 
 
 
* get detailed infomation about one pci-flavor:
 
      nova pci-flavor-show  <UUID>
 
    GET v2/​{tenant_id}​/os-pci-flavor/<UUID>
 
    data:
 
        os-pci-flavor: {
 
                                'UUID':'xxxx-xx-xx' ,
 
                                'description':'xxxx'
 
                                  ....
 
                                'address': '0000:01:*.7',
 
                                'pci-flavor':'xxx',
 
          }
 
 
* create pci flavor
 
  nova pci-flavor-create  name 'GetMePowerfulldevice'  description "xxxxx"
 
  API:
 
  POST  v2/​{tenant_id}​/os-pci-flavors
 
  data:
 
      pci-flavor: {
 
            'name':'GetMePowerfulldevice',
 
              description: "xxxxx"
 
      }
 
  action:  create database entry for this flavor.
 
 
 
*update the pci flavor
 
    nova pci-flavor-update UUID  set    'description'='xxxx'  'address'= '0000:01:*.7', 'host'='compute-id'
 
    PUT v2/​{tenant_id}​/os-pci-flavors/<UUID>
 
    with data  :
 
        { 'action': "update",
 
          'pci-flavor':
 
                          {
 
                            'description':'xxxx',
 
                            'address': '0000:01:*.7'}
 
                          }
 
        }
 
    action: set this as the new defination of the pci flavor.
 
 
* delete a pci flavor
 
  nova pci-flavor-delete <UUID>
 
  DELETE v2/​{tenant_id}​/os-pci-flavor/<UUID>
 
  
===Requements from SRIOV===
+
===Requirements from SRIOV===
 
*group device
 
*group device
 
   for SRIOV, all VFs belong to same PF share same physical network reachability. so if you want, say, deploy a vlan network, you need choose the right PF's VF, otherwise network does not work for you.  the pci flavor does this work well.
 
   for SRIOV, all VFs belong to same PF share same physical network reachability. so if you want, say, deploy a vlan network, you need choose the right PF's VF, otherwise network does not work for you.  the pci flavor does this work well.

Revision as of 07:48, 21 January 2014

background

this design based on this PCI passthrough meeting: https://wiki.openstack.org/wiki/Meetings/Passthrough https://docs.google.com/document/d/1vadqmurlnlvZ5bv3BlUbFeXRS_wh-dsgi5plSjimWjU/edit# link back to bp: https://blueprints.launchpad.net/nova/+spec/pci-extra-info

PCI devices has PCI standard property like address(BDF), vendor_id, product_id, etc, and PCI property like every Virtual Function's physical address , it also should can be attach some application specific extra information, like physical network connectivity used by neutron SRIOV, or any other property.

All kind of these PCI property should be well classify, and for every property what's the scope it belong to should be well defined. our design will focus on several PCI modules to provide the PCI pass-through SRIOV support, it's current functionality is:

  • on compute node the whit-list define a spec which filter the PCI device, given a set of PCI device which is available for allocation.
  • PCI compute also report the PCI stats information to scheduler. PCI stats contain several pools. each pool defined by several PCI property (vendor_id, product_id, extra_info).
  • PCI alias define the user point of view PCI device selector: alias provide a set of (k,v) to form specs to select the available device(filter by white list).

PCI NEXT over all design

PCI flavor used to select available device

User use the PCI flavor to select available device. and the PCI flavor is global to all cloud, and can be configuration via API. user treat the PCI flavor as an resealable name like 'oldGPU', 'FastGPU', '10GNIC', 'SSD'.


=define the flavors on Control node

Control node has flavors which allow the administrator to package up devices for users. flavors have a name, and matching expression that selects available(offer by white list) devices. flavors can overlap - that is, the same device on the same machine may be matched by multiple flavors.

PCI flavor defined by a set of (k,v), the k is the *well defined* PCI property. not every PCI property is available to PCI flavor, only a specific set of PCI property can used to define the PCI flavor. these PCI property is defined via a global configuration :

    pci_flavor_attrs = [ vendor_id, product_id, ...]

Only the global attrs should be one of pci_falvor_attrs, like the vendor, product, etc, the 'host', and 'BDF' of a pci device should not be used as pci_flavor_attrs. this is explicitly an optimization to simplify scheduling complexity, we may change this to an API-changeable value in the future, though we believe it will rarely be necessary to change its value.

Compute node offers up devices via local config

the compute nodes offer available PCI devices for pass-through, since the list of devices doesn't usually change unless someone tinkers with the hardware, this matching expression used to create this list of offered devices is stored in compute node config.

   *the device information (device ID, vendor ID, BDF etc.) is discovered from the device and stored as part of the PCI device, same as current implement.
   *on the compute node, additional arbitrary information, in the form of key-value pairs, can be added to the config and is included in the PCI device

this is achived by extend the pci white-list to: pci_information = { pci-regex,pci-extra-attrs } pci-regex is a dict of { string-key: string-value } pairs , it can only match device properties, like vendor_id, address, product_id,etc. pci-extra-attrs is a dict of { string-key: string-value } pairs. The values can be arbitrary The total size of the extra attrs may be restricted.

PCI NEXT Config

Compute host

pci_information = { pci-regex,pci-extra-attrs }

Control node

pci_flavor_attrs=[attr,attr,attr] For instance, when using device and vendor ID this would read:

    pci_flavor_attrs=device_id,vendor_id

When the backend adds an arbitrary ‘group’ attribute to all PCI devices:

    pci_flavor_attrs=e.group

When you wish to find an appropriate device and perhaps also filter by the connection tagged on that device, which you use an extra-info attribute to specify on the compute node config: pci_flavor_attrs=device_id,vendor_id,e.connection

flavor API

  • overall

nova pci-flavor-list nova pci-flavor-show name|UUID <name|UUID> nova pci-flavor-create name|UUID <name|UUID> description <desc> nova pci-flavor-update name|UUID <name|UUID> set 'description'='xxxx' 'e.group'= 'A' nova pci-flavor-delete <name|UUID> name|UUID


* list available pci flavor  (white list)
   nova pci-flavor-list 
   GET v2/​{tenant_id}​/os-pci-flavors
   data:
    os-pci-flavors{
                [ 
                           { 
                               'UUID':'xxxx-xx-xx' , 
                               'description':'xxxx' 
                               'vendor_id':'8086', 
                                 ....
                                'name':'xxx', 
                              } ,
               ]
    }


  • get detailed information about one pci-flavor:
     nova pci-flavor-show  <UUID>
    GET v2/​{tenant_id}​/os-pci-flavor/<UUID>
    data:
       os-pci-flavor: { 
                               'UUID':'xxxx-xx-xx' , 
                               'description':'xxxx' 
                                  ....
                                'name':'xxx', 
         } 
  • create pci flavor
 nova pci-flavor-create  name 'GetMePowerfulldevice'  description "xxxxx"
 API:
 POST  v2/​{tenant_id}​/os-pci-flavors
 data: 
     pci-flavor: { 
            'name':'GetMePowerfulldevice',
             description: "xxxxx" 
     }
 action:  create database entry for this flavor.


  • update the pci flavor
    nova pci-flavor-update UUID  set    'description'='xxxx'   'e.group'= 'A'
    PUT v2/​{tenant_id}​/os-pci-flavors/<UUID>
    with data  :
        { 'action': "update", 
          'pci-flavor':
                         { 
                            'description':'xxxx',
                            'vendor': '8086',
                            'e.group': 'A',
                             ....
                         }
        }
   action: set this as the new definition of the pci flavor.
  • delete a pci flavor
  nova pci-flavor-delete <UUID>
  DELETE v2/​{tenant_id}​/os-pci-flavor/<UUID>

nova command extension : --nic with pci-flavor

attaches a virtual NIC to the Neutron network and the VM nova boot --nic net-id=neutron-network,vnic-type=macvtap,pci-flavor=xxx nova boot --nic net-id=neutron-network,vnic-type=macvtap,pci-flavor=xxx

Use cases

General PCI pass through =

given compute nodes contain 1 GPU with vendor:device 8086:0001

  • on the compute nodes, config the pci_information
   pci_information =  { { 'device_id': "8086", 'vendor_id': "0001" }, {} }
  • on controller
  pci_flavor_attrs = ['device_id', 'vendor_id']

the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report one pool: {'device_id':'0001', 'vendor_id':'8086', 'count': 1 }

  • create PCI flavor

nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '0001'

  • create flavor and boot with it

nova flavor-key m1.small set pci_passthrough:pci_flavor= 1:bigGPU nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec

General PCI pass through with multi PCI flavor candidate =

given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002

  • on the compute nodes, config the pci_information
   pci_information =  { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
  • on controller
  pci_flavor_attrs = ['device_id', 'vendor_id']

the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report 2 pool:

{'device_id':'0001', 'vendor_id':'8086', 'count': 1 }

{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }

  • create PCI flavor

nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '0001'

nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU2' set 'vendor_id'='8086' 'product_id': '0002'

  • create flavor and boot with it

nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU,bigGPU2;' nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec

General PCI pass through wild-cast PCI flavor =

given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002

  • on the compute nodes, config the pci_information
   pci_information =  { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
  • on controller
  pci_flavor_attrs = ['device_id', 'vendor_id']

the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report 2 pool:

{'device_id':'0001', 'vendor_id':'8086', 'count': 1 }

{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }

  • create PCI flavor

nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '000[1-2]'

  • create flavor and boot with it

nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU;' nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec

PCI pass through support grouping tag =

given compute nodes contain 2 type GPU with , vendor:device 8086:0001, or vendor:device 8086:0002

  • on the compute nodes, config the pci_information
   pci_information =  { { 'device_id': "8086", 'vendor_id': "000[1-2]" }, {} }
  • on controller
  pci_flavor_attrs = ['device_id', 'vendor_id']

the compute node would report PCI stats group by ('device_id', 'vendor_id'). pci stats will report 2 pool:

{'device_id':'0001', 'vendor_id':'8086', 'count': 1 }

{'device_id':'0002', 'vendor_id':'8086', 'count': 1 }

  • create PCI flavor

nova pci-flavor-create name 'bigGPU' description 'passthrough Intel's on-die GPU' nova pci-flavor-update name 'bigGPU' set 'vendor_id'='8086' 'product_id': '000[1-2]'

  • create flavor and boot with it

nova flavor-key m1.small set pci_passthrough:pci_flavor= '1:bigGPU;' nova boot mytest --flavor m1.tiny --image=cirros-0.3.1-x86_64-uec


<ijw> In this, pci_attrs is 'group' and we're adding a 'group' attr at the pci_information stage

  • sarob 已退出(Ping timeout: 252 seconds)

<ijw> So, pci_information={ { device_id: 1, vendor_id: 0x8086}, { group => 'gpu' } } <baoli> what is device_id? <ijw> The device ID of the whitelisted PCI device <ijw> Do I mean device ID? <ijw> Yes, that's the right term

  • ruhe 改名为 ruhe_away

<baoli> but we don't have a device_id <heyongli> product_id , i think it is <heyongli> the standard pci property <ijw> OK, sorry - I'm looking at a (non-Openstack) page which calls it device ID, my apologies <ijw> So, compute host gets 'group' for attrs when it requests it <ijw> And makes one PCI stat per 'group' value, rather than unique (vendor_id, product_id) tuple and reports that to the scheduler in this case <ijw> PCI flavor is going to be name gpu match expression { group: 'gpu' } <ijw> Scheduling is as before <ijw> Sorry, it will be 'e.group': 'gpu' per the spec about marking extra info when we use it

  • ruhe_away 改名为 ruhe

<ijw> Difficult cases. <ijw> Scheduling is harder when PCI flavors are vague and match multiple pci_stats rows, and particularly when they overlap

  • coolsvap (~SwapnilK@14.97.34.59) 进入了 #openstack-meeting-alt

<ijw> So, if I have two flavors, one is vendor:8086, device: 1 or 2 and one of which is vendor: 8086, device: 1 then you have more problems and the worst case is when you use both flavors in a single instance <ijw> So, say I have the above two flavors, 'vague' and 'specific' for the sake of argument, and I want to start an instance with one device from each <irenab> ijw: can we go over networking case taking into account phy_net connectivity? <ijw> OK, in a sec, I'll finish this case first <irenab> ijw: afraid to be out of time <ijw> When I try and schedule, let's assume I have one product_id of each type

  • amrith 已退出(Ping timeout: 245 seconds)

<ijw> Then there's a case that I can't use the device '1' for 'vague' because I would not be able to allocate anything for 'specific'. The problem is hard, it's not insoluble, you just have to try multiple combinations until you succeed.


transite config file to API

  1. the config file for alias and whitelist defination is going to deprecated.
  2. if database is not NULL , configration is ommit and given deprecated warning.
  3. if database is NULL, config if read from the file,
    *white list/alias schema still work
    * And also  given a deprecated notice, alias will fade out  which will be remove start from next release.

with this solution, we move pci config from file to API.

DB for pci configration

each pci flavor will be a set of (k,v), and the pci flavor don't need to contain same k, v pair. another problem this define try to slove is, i,.e SRIOV also want feature autodiscovery(under discuss), with this, the flavor might need a 'feature' key to be added if not store it as (k,v) pair. the (k,v) paire define let more extra infomation can be store in the pci device.

  talbe: pci_flavor{
               id   :  data base of this k,v pair
               UUID:  which pci-flavor the  k,v belong to
               key 
               value (might be a simple value or reduce Regular express)
           }


Requirements from SRIOV

  • group device
  for SRIOV, all VFs belong to same PF share same physical network reachability. so if you want, say, deploy a vlan network, you need choose the right PF's VF, otherwise network does not work for you.  the pci flavor does this work well.
  • mark the device alloced to the the flavor
  networking or other special deive is not as simple as pass though to the VM, there is need more configration. to acheive this, SRIOV must know the device infomation allocation to the specific flavor.

Implement the grouping

concept introduce here: spec: a filter defined by (k,v) paris, which k in the pci object fileds, this means those (k,v) is the pci device property like: vendor_id, 'address', pci-type etc. extra_spec: the filter defined by (k, v) and k not in the pci object fileds.

pci utils/objects support grouping

      * pci utils k,v match support the address reduce regular expression
      * objects provide a class level extrac interface to extract base spec and extra spec

pci-flavor(white list) support address set

      * white list support 'address' reduce regular expresion compare.
      * white list support  any other (k,v) pair to group or store special infomation 
      * object extrac specs and extra_info, specs use as whitelist spec, extra info will be updated to device's extra_info fields

enable flavor support pci-flavor

      * pci-flavor's name set in the extra spec in the instance type 
      * pci manager use extrac method extrac the specs and extra_specs, match them agains  the pci object & object.extra_info.

pci stats grouping device base on pci flavor

        *  current gourping base on  [vendor_id, product_id, extra_info]
        *  going to use 'pci-flavor' grouping the device.
        * still keep compatible by default, via a new config option switch to new grouping policy.

Implement device mark from the pci-flavor

here is the idea how user can identify which device allocated for the pci-flavor.

    *while define the flavor, put a  marker(network uuid)  into the flavor then store in the device's extra_info fileds
    *after finished allocation, user can seach a instance's pci devices to find the specific device do further configration
    the way marker data transfer from user to device utilize the pci_request, which convert from the pci-flavor.