HeterogeneousTileraSupport


 * Launchpad Entry: NovaSpec:heterogeneous-tilera-architecture-support
 * Created: Brian Schott
 * Maintained:Mikyung Kang
 * Contributors: USC Information Sciences Institute

Summary
This blueprint proposes to add support for the Tilera tiled-processor machines as an alternative machine type in OpenStack. This blueprint is dependent on the schema changes described in the HeterogeneousInstanceTypes blueprint and the scheduler in HeterogeneousArchitectureScheduler.

The target release for this is Essex:
 * Nova upstrem: https://github.com/openstack/nova/tree/stable/essex
 * ISI github: https://github.com/usc-isi/nova/tree/hpc-trunk-essex

The folsom branch is here:
 * Nova upstream: https://github.com/openstack/nova
 * ISI github: https://github.com/usc-isi/nova (working)

This blueprint is related to the HeterogeneousInstanceTypes blueprint here:
 * http://wiki.openstack.org/HeterogeneousInstanceTypes

We are also drafting blueprints for other machine types:
 * http://wiki.openstack.org/HeterogeneousGpuAcceleratorSupport
 * http://wiki.openstack.org/HeterogeneousSgiUltraVioletSupport

An etherpad for discussion of this blueprint is available at http://etherpad.openstack.org/heterogeneoustilerasupport

Release Note
Nova has been extended to support non-virtualizable architecture, as an example, Tilera TILEmpower Platform (TILEPro64 Processor).

Rationale
See HeterogeneousInstanceTypes.

User stories
Jackie has an application and wants to run on a CPU that does not support KVM or XEN. As an example, she wants to run the application on a TILEmpower board that does not support KVM or XEN. She chooses tp64.8x8 instance type that provides access to a TILEmpower board with a TILEPro64 processor having 64 cores in it.

$ euca-run-instances -k $key -t tp64.8x8 ami-5c86a016

$ ssh -i key.pem $Given_Tilera_IP_address

Assumptions
This blueprint is dependent on tp64.8x8 being a selectable instance type and that the scheduler knows this instance must get routed to a TILEmpower board. See HeterogeneousArchitectureScheduler.

Supporting non-x86 architectures
Some (non-x86 based) machine architectures of interest to technical computing users have either poor or non-existent support for virtualization. For example, our heterogeneous target, Tilera Linux (MDE-3.0) does not yet support KVM or Xen virtualization.

One alternative to using virtualization to provision hardware in a cloud environment is to do bare-metal provisioning: rebooting the machine to a fresh system image before handing over control to the user, and wiping the local hard drive when the user is done with the resources.

To support the Tilera architecture through OpenStack, we developed a proxy compute node implementation, where our customized nova-compute service acts as a front-end that proxies requests for nodes to a Tilera-specific back-end that does the bare metal provisioning of the nodes as needed.

Our intention is to ultimately support different provisioning back-ends. Several provisioning tools are available, such as Dell's crowbar, as an extension of opscode's Chef system, Argonne National Lab's Heckle, xCat, Perceus, OSCAR, and ROCKS. These tools provide different bare-metal provisioning, deployment, resource management, and authentication methods for different architectures. These tools use standard interfaces such as PXE (Preboot Execution Environment) boot and IPMI (Intelligent Platform Management Interface) power cycle management module. For boards that do not support PXE and IPMI, such as the TILEmpower board, specific back ends must be written.

For supporting non-x86 architecture (ex. TILERA), Proxy Compute Node should be designed.

An x86 Proxy Compute Node is connected to the TILEmpower boards through network. A Proxy Compute Node may handle multiple TILEmpower boards. TILEmpower boards are connected to the network such that a cloud user can ssh into them directly after an instance starts on the TILEmpower board. A TILEmpower board is configured to be tftp-bootable or nfs-bootable. Proxy Compute Node behaves as the tftp/nfs server for the TILEmpower boards. After Proxy Compute node receives instance images from the image server, it wakes up a TILEmpower board and controls their booting. Once a TILEmpower board is booted, Proxy Compute Node doesn't do anything except terminating/rebooting/power-down/power-up of the board. Once Tilera instance is running, user can access the TILEmpower board, not Proxy Compute Node, through ssh. Here, we assume that Proxy Compute Node can power on/off TILEmpower boards remotely using PDU(Power Distribute Unit). The block diagram shown below describes the procedure in detail.



Proxy Compute Node keeps the information of the systems that depends on the Proxy Compute Node in the file /tftpboot/architecure_information_file (ex. tilera_boards). The file contains static information of the boards such as MAC address, processor type, memory size, disk size. It is used for handling dynamic information such as the status of the board (power off/booting/running/shutting down), and the instance_id if a board is running an instance. It is Proxy Compute Node's responsibility to keep track of the status of the dependent systems.

1. TFTP setting for Proxy Compute Node

$ vi /etc/xinetd.d/tftp service tftp { protocol = udp port = 69 socket_type = dgram wait = yes user = root server = /usr/sbin/in.tftpd server_args = /tftpboot disable = no } $ /etc/init.d/xinetd restart

2. File preparation in Proxy Compute Node:/tftpboot

Copy the following files to /tftpboot: • vmlinuz  // linux image • initrd       // init ramfile system • disk         // file system image • architecture_information_file (ex. tilera_boards) • pdu_mgr

2.1 Example of architecture_information_file: /tftpboot/tilera_boards : Proxy Compute Node manages TILEmpower board information (board_id, board_ip_address, board_mac_address, board_hw_description, etc.) using this tilera_boards file. # board_id ip_address mac_address vcpus memory_mb local_gb memory_mb_used logcal_gb_used hv_type hv_ver cpu_info 0           10.0.2.1   00:1A:CA:00:57:90 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 1           10.0.2.2   00:1A:CA:00:58:98 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 2           10.0.2.3   00:1A:CA:00:58:50 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 3           10.0.2.4   00:1A:CA:00:57:A8 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 4           10.0.2.5   00:1A:CA:00:58:AA 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 5           10.0.2.6   00:1A:CA:00:58:2C 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 6           10.0.2.7   00:1A:CA:00:58:5C 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 7           10.0.2.8   00:1A:CA:00:58:A4 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 8           10.0.2.9   00:1A:CA:00:58:1A 10 16218 917 476 1 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}} 9           10.0.2.10  00:1A:CA:00:58:38 10 16385 1000 0 0 tilera_hv 1  \ {"vendor":"tilera","model":"TILEmpower","arch":"TILEPro64", \ "features":["8x8Grid","32bVLIW","5.6MBCache","443BOPS","37TbMesh", \ "700MHz-866MHz","4DDR2","2XAUIMAC/PHY","2GbEMAC"],"topology":{"cores":"64"}}

2.2 pdu-mgr : PDU(Power Distribute Unit)-controlling EXPECT script for remote power-on/off/reboot

Schema Changes in Bare metal implementation for TILEmpower boards
We're proposing the following default values added to the instance_types table.

't64.8x8': dict(memory_mb=16384, vcpus=1, local_gb=500,                    flavorid=301,                    cpu_arch="tile64",                    cpu_info='{"geometry":"8x8"}'), 'tp64.8x8': dict(memory_mb=16384, vcpus=1, local_gb=500,                   flavorid=302,                    cpu_arch="tilepro64",                    cpu_info='{"geometry":"8x8"}'), 'tgx.4x4': dict(memory_mb=16384, vcpus=1, local_gb=500,                    flavorid=303,                    cpu_arch="tile-gx16",                    cpu_info='{"geometry":"4x4"}'), 'tgx.6x6': dict(memory_mb=16384, vcpus=1, local_gb=500,                    flavorid=304,                    cpu_arch="tile-gx36",                    cpu_info='{"geometry":"6x6"}'), 'tgx.8x8': dict(memory_mb=16384, vcpus=1, local_gb=500,                    flavorid=305,                    cpu_arch="tile-gx64",                    cpu_info='{"geometry":"8x8"}'), 'tgx.10x10': dict(memory_mb=16384, vcpus=1, local_gb=500,                      flavorid=306,                      cpu_arch="tile-gx100",                      cpu_info='{"geometry":"10x10"}')

Proxy Compute Node in Bare metal implementation for TILEmpower boards (nova/virt/tilera.py)
Tilera Compute Node shown in below is an example of Proxy Compute Node. After setting the status of instance and domain as Pending, Proxy Compute Node set vmlinux. If vmlinux is already set in CF(Compact Flash) on the board, this step is not needed. The x means the board_id. The mboot-run through nfs root setups tilera file system for the corresponding board with nfs root directory. After that, Proxy Compute Node sets the status of instance and domain as Running. Then user can access the board through ssh.

Option1: nfs-bootable (Current nova code supports option1)

For option2 below, the x means the board_id and 1 means the first boot with vmlinux image which sets TLR_ROOT=tmpfs. By default the rootfs is copied to a tmpfs whose size limit is half of total memory. After 1st mboot-run through tftp download of vmlinux, Proxy Compute Node uploads compressed tilera file system into memory, mounts /dev/sda1 to /mnt, and uncompresses the uploaded tilera file system into /mnt disk space. And then Proxy Compute Node copies vmlinux_x_2 to vmlinux_x. The x means the board_id and 2 means the second boot with vmlinux image which sets TLR_ROOT=/dev/sda1. After second mboot-run, rootfs is set as /dev/sda1. After that, Proxy Compute Node sets the status of instance and domain as Running. Then user can access the board through ssh.

Option2: tftp-bootable (Previous nova code supports option2)

Implementation
The USC-ISI team has a functional prototype: https://github.com/usc-isi/nova/tree/hpc-testing

Proxy Compute Node should be implemented as virt/baremetal/proxy.py, virt/baremetal/dom.py and specific_architecture.py (for example, tilera.py or arm.py). The proxy.py code may describe Connection class of the non-virtualizable architecture, dom.py code may describe domain related modules and specific architecture calls may be invoked within the proxy.py and dom.py codes.



UI Changes
The following will be available as new default instance types.

Tilera TILEPro64

 * API name: tp64.8x8
 * TILEPro64 processor: 64 (8x8) cores
 * 16 GB RAM (16384 MB)
 * 1 TB of instance storage
 * http://www.tilera.com/

(Only one Tilera instance type for now. When KVM support appears, we will add additional types to support partitioning into smaller instances)

[Code Changes]
These codes are already merged into nova upstream.


 * nova/virt/connection.py
 * add tilera connection_type
 * nova/virt/baremetal/ init .py
 * nova/virt/baremetal/proxy.py --> Naming was changed to sync with other libvirt/xen connection --> driver.py
 * similar to nova/virt/libvirt/connection.py except domain processing
 * nova/virt/baremetal/dom.py
 * For different domain management, that part was separated to the BareMetalDom [dom.py]. The BareMetalDom is needed for managing domain status as libvirt does for KVM or Xen. Because baremetal domain is not supported by libvirt, virtual domain management method was newly written using file I/O. This can be upgraded or used by a new domain manager.
 * nova/virt/baremetal/nodes.py
 * Back-end is selected. For example, tilera.py, fake.py, arm.py, or pxe.py.
 * nova/virt/baremetal/tilera.py
 * The Tilera driver [tilera.py] is one example among several baremetal types. Baremetal type can be tilera, arm, or other pxe_boot node, something like that. Currently only tilera driver is written and this tilera.py code manages actual booting/terminating of tilera node. But other type can be attached for using same BareMetalConnection and BareMetalDom. To apply this to other baremetal type, we are using baremetal_driver flag.
 * nova/virt/baremetal/fake.py
 * nova/tests/baremetal/ init .py
 * nova/tests/baremetal/test_proxy_bare_metal.py
 * nova/tests/baremetal/test_tilera.py

[nova.conf]
Most of the nova.conf options are shown at:
 * http://docs.openstack.org/essex/openstack-compute/admin/content/compute-options-reference.html

For tilera bare-metal driver, you'll need to set your options in nova.conf as follows:
 * --connection_type=baremetal
 * --baremetal_driver=tilera
 * --cpu_arch=tilepro64
 * --tile_monitor=

[tftpboot directory]

 * /tftpboot/tilera_boards: tilera bare-metal node information
 * /tftpboot/fs_*: NFS mount directory (ex. fs_1=tilera board id#1)

Migration
Very little needs to change in terms of the way deployments will use this if we set sane defaults like "x86_64" as assumed today.

Test/Demo Plan
This need not be added or completed until the specification is nearing beta.

Unresolved issues
One of the challenges we have is that the flavorid field in the instance_types table isn't auto-increment. We've selected high numbers to avoid collisions, but the community should discuss how flavorid behaves and the best approach for adding future new instance types.

BoF agenda and discussion
Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.