Jump to: navigation, search

Baremetal

Revision as of 16:23, 2 August 2012 by Mikyung (talk)

Summary

This blueprint proposes to support general bare-metal provisioning framework in OpenStack.

The target release for this is Folsom. USC/ISI and NTT docomo are working on integration of bare-metal provisioning implementation to support following stuff:

  • Support PXE and non-PXE bare-metal machines (Review#1)
  • Support several architecture types such as x86_64, tilepro64, and arm (Review#1)
  • Support fault-tolerance of bare-metal nova-compute node (Review#2)

The USC/ISI team has a branch here (general bare-metal provisioning framework and non-PXE support):

An etherpad for discussion of this blueprint is available at http://etherpad.openstack.org/FolsomBareMetalCloud

Based on that, NTT docomo team and USC/ISI team have collaborated for new general bare-metal provisioning framework.

Code changes

    nova/nova/virt/baremetal/*
    nova/nova/tests/baremetal/*
    nova/bin/bm*
    nova/nova/scheduler/baremetal_host_manager.py
    nova/nova/tests/scheduler/test_baremetal_host_manager.py


Overview

1) A user requests a baremetal instance.

  • Non-PXE (Tilera):
    euca-run-instances -t tp64.8x8 -k my.key ami-CCC    
  • PXE
    euca-run-instances -t baremetal.small --kernel aki-AAA --ramdisk ari-BBB ami-CCC


2) nova-scheduler selects a baremetal nova-compute with the following configuration.

  • Here we assume that:
    $IP
       MySQL for baremetal DB runs at the machine whose IP address is $IP(127.0.0.1). It must be changed if a different IP address is used.

    $ID
      $ID should be replaced by MySQL user id 

    $Password
      $Password should be replaced by MySQL password


  • Non-PXE (Tilera) [nova.conf]:
    baremetal_sql_connection=mysql://$ID:$Password@$IP/nova_bm
    compute_driver=nova.virt.baremetal.driver.BareMetalDriver
    baremetal_driver=tilera
    power_manager=tilera_pdu
    instance_type_extra_specs=cpu_arch:tilepro64
    baremetal_tftp_root = /tftpboot
    scheduler_host_manager=nova.scheduler.baremetal_host_manager.BaremetalHostManager


  • PXE [nova.conf]:
    baremetal_sql_connection=mysql://$ID:$Password@$IP/nova_bm
    compute_driver=nova.virt.baremetal.driver.BareMetalDriver
    baremetal_driver=pxe
    power_manager=ipmi
    instance_type_extra_specs=cpu_arch:x86_64
    baremetal_tftp_root = /tftpboot
    scheduler_host_manager=nova.scheduler.baremetal_host_manager.BaremetalHostManager
    baremetal_deploy_kernel = xxxxxxxxxx
    baremetal_deploy_ramdisk = yyyyyyyy


3) The bare-metal nova-compute selects a bare-metal node from its pool based on hardware resources and the instance type (# of cpus, memory, HDDs).

4) Deployment images and configuration are prepared.

  • Non-PXE (Tilera):
    • The key injected file system is prepared and then NFS directory is configured for the bare-metal nodes. The kernel is already put to CF(Compact Flash Memory) of each tilera board and the ramdisk is not used for the tilera bare-metal nodes. For NFS mounting, /tftpboot/fs_x (x=node_id) should be set before launching instances.
  • PXE:
    • kernel and ramdisk for the deployment, and the user specified kernel and ramdisk are put to TFTP server. PXE are configured for the baremetal host.

5) The baremetal nova-compute powers on the baremetal node thorough

  • Non-PXE (Tilera): PDU(Power Distribution Unit)
  • PXE: IPMI

6) The image is deployed to bare-metal node.

  • Non-PXE (Tilera): The images are deployed to bare-metal nodes. nova-compute mounts AMI into NFS directory based on the id of the selected tilera bare-metal node.
  • PXE: The host uses the deployment kernel and ramdisk, and the baremetal nova-copute writes AMI to the host's local disk via iSCSI.

7) Bare-metal node is booted.

  • Non-PXE (Tilera):
    • The bare-metal node is configured for network, ssh, and iptables rule.
    • Done.
  • PXE:
    • The host is rebooted.
    • Next, the host is booted up by the user specified kernel, ramdisk and its local disk.
    • Done.

Packages A: Non-PXE (Tilera)

  • This procedure is for RHEL. Reading 'tilera-bm-instance-creation.txt' may make this document easy to understand.
  • TFTP, NFS, EXPECT, and Telnet installation:
    $ yum install nfs-utils.x86_64 expect.x86_64 tftp-server.x86_64 telnet


  • TFTP configuration:
    $ cat /etc/xinetd.d/tftp
    # default: off
    # description: The tftp server serves files using the trivial file transfer \
    #       protocol.  The tftp protocol is often used to boot diskless \
    #       workstations, download configuration files to network-aware printers,
    #       \
    #       and to start the installation process for some operating systems.
    service tftp  
    {
          socket_type             = dgram
          protocol                = udp
          wait                    = yes
          user                    = root
          server                  = /usr/sbin/in.tftpd
          server_args             = -s /tftpboot
          disable                 = no
          per_source              = 11
          cps                     = 100 2
          flags                   = IPv4
    }
    $ /etc/init.d/xinetd restart


  • NFS configuration:
    $ mkdir /tftpboot
    $ mkdir /tftpboot/fs_x (x: the id of tilera board)   
    $ cat /etc/exports
    /tftpboot/fs_0 tilera0-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_1 tilera1-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check) 
    /tftpboot/fs_2 tilera2-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_3 tilera3-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_4 tilera4-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_5 tilera5-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_6 tilera6-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_7 tilera7-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_8 tilera8-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    /tftpboot/fs_9 tilera9-eth0(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
    $ sudo /etc/init.d/nfs restart
    $ sudo /usr/sbin/exportfs


  • TileraMDE install: TileraMDE-3.0.1.125620:
    $ cd /usr/local/
    $ tar -xvf tileramde-3.0.1.125620_tilepro.tar
    $ tar -xjvf tileramde-3.0.1.125620_tilepro_apps.tar.bz2
    $ tar -xjvf tileramde-3.0.1.125620_tilepro_src.tar.bz2
    $ mkdir /usr/local/TileraMDE-3.0.1.125620/tilepro/tile
    $ cd /usr/local/TileraMDE-3.0.1.125620/tilepro/tile/
    $ tar -xjvf tileramde-3.0.1.125620_tilepro_tile.tar.bz2
    $ ln -s /usr/local/TileraMDE-3.0.1.125620/tilepro/ /usr/local/TileraMDE


  • Installation for 32-bit libraries to execute TileraMDE:
    $ yum install glibc.i686 glibc-devel.i686


Packages B: PXE

  • This procedure is for Ubuntu 12.04 x86_64. Reading 'baremetal-instance-creation.txt' may make this document easy to understand.
  • dnsmasq (PXE server for baremetal hosts)
  • syslinux (bootloader for PXE)
  • ipmitool (operate IPMI)
  • qemu-kvm (only for qemu-img)
  • open-iscsi (connect to iSCSI target at berametal hosts)
  • busybox (used in deployment ramdisk)
  • tgt (used in deployment ramdisk)
  • Example:
    $ sudo apt-get install dnsmasq syslinux ipmitool qemu-kvm open-iscsi
    $ sudo apt-get install busybox tgt


    $ cd baremetal-initrd-builder
    $ ./baremetal-mkinitrd.sh <ramdisk output path> <kernel version>


    $ ./baremetal-mkinitrd.sh /tmp/deploy-ramdisk.img 3.2.0-26-generic
    working in /tmp/baremetal-mkinitrd.9AciX98N
    368017 blocks
    Register the kernel and the ramdisk to Glance.


    $ glance add name="baremetal deployment ramdisk" is_public=true container_format=ari disk_format=ari < /tmp/deploy-ramdisk.img
    Uploading image 'baremetal deployment ramdisk'
    ===========================================[100%] 114.951697M/s, ETA  0h  0m  0s
    Added new image with ID: e99775cb-f78d-401e-9d14-acd86e2f36e3
    
    $ glance add name="baremetal deployment kernel" is_public=true container_format=aki disk_format=aki < /boot/vmlinuz-3.2.0-26-generic
    Uploading image 'baremetal deployment kernel'
    ===========================================[100%] 46.9M/s, ETA  0h  0m  0s
    Added new image with ID: d76012fc-4055-485c-a978-f748679b89a9


  • ShellInABox
  • Baremetal nova-compute uses [ShellInABox](http://code.google.com/p/shellinabox/) so that users can access baremetal host's console through web browsers.
  • Build from source and install:
    $ sudo apt-get install gcc make
    $ tar xzf shellinabox-2.14.tar.gz
    $ cd shellinabox-2.14
    $ ./configure
    $ sudo make install


  • PXE Boot Server
  • Prepare TFTP root directory:
    $ sudo mkdir /tftpboot
    $ sudo cp /usr/lib/syslinux/pxelinux.0 /tftpboot/
    $ sudo mkdir /tftpboot/pxelinux.cfg
  • Start dnsmasq. Example: start dnsmasq on eth1 with PXE and TFTP enabled:
    $ sudo dnsmasq --conf-file= --port=0 --enable-tftp --tftp-root=/tftpboot --dhcp-boot=pxelinux.0 --bind-interfaces --pid-file=/dnsmasq.pid --interface=eth1 --dhcp-range=192.168.175.100,192.168.175.254
    
    (You may need to stop and disable dnsmasq)
    $ sudo /etc/init.d/dnsmasq stop
    $ sudo sudo update-rc.d dnsmasq disable


  • How to create an image:
  • Example: create a partition image from ubuntu cloud images' Precise tarball:
$ wget http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-amd64-root.tar.gz
$ dd if=/dev/zero of=u.img bs=1M count=0 seek=1024
$ mkfs -F -t ext4 u.img
$ sudo mount -o loop u.img /mnt/
$ sudo tar -C /mnt -xzf ~/precise-server-cloudimg-amd64-root.tar.gz
$ sudo rm /mnt/etc/resolv.conf
        # (set a temporary DNS server to use apt-get in chroot (8.8.8.8 is Google Public DNS address))
$ sudo echo nameserver 8.8.8.8 >/mnt/etc/resolv.conf
$ sudo chroot /mnt apt-get install linux-image-3.2.0-26-generic vlan open-iscsi
$ ln -sf ../run/resolvconf/resolv.conf /mnt/etc/resolv.conf
$ sudo umount /mnt


Nova Directories

    $ sudo mkdir /var/lib/nova/baremetal
    $ sudo mkdir /var/lib/nova/baremetal/console
    $ sudo mkdir /var/lib/nova/baremetal/dnsmasq


Nova Database

  • Create the baremetal database. Grant all provileges to the user specified by the 'baremetal_sql_connection' flag. Example:
$ mysql -p
mysql> create database nova_bm;
mysql> grant all privileges on nova_bm.* to '$ID'@'%' identified by '$Password';
mysql> exit


  • Create tables:
$ bm_db_sync


Create Baremetal Instance Type

  • First, create an instance type in the normal way.
$ nova-manage instance_type create --name=tp64.8x8 --cpu=64 --memory=16218 --root_gb=917 --ephemeral_gb=0 --flavor=6 --swap=1024 --rxtx_factor=1
$ nova-manage instance_type create --name=bm.small --cpu=2 --memory=4096 --root_gb=10 --ephemeral_gb=20 --flavor=7 --swap=1024 --rxtx_factor=1
(about --flavor, see 'How to choose the value for flavor' section below)


  • Next, set baremetal extra_spec to the instance type:
$ nova-manage instance_type set_key --name=tp64.8x8 --key cpu_arch --value 's== tilepro64'
$ nova-manage instance_type set_key --name=bm.small --key cpu_arch --value 's== x86_64'


How to choose the value for flavor

  • Run nova-manage instance_type list, find the maximum FlavorID in output. Use the maximum FlavorID+1 for new instance_type.
$ nova-manage instance_type list
m1.medium: Memory: 4096MB, VCPUS: 2, Root: 40GB, Ephemeral: 0Gb, FlavorID: 3, Swap: 0MB, RXTX Factor: 1.0, ExtraSpecs {}
m1.small: Memory: 2048MB, VCPUS: 1, Root: 20GB, Ephemeral: 0Gb, FlavorID: 2, Swap: 0MB, RXTX Factor: 1.0, ExtraSpecs {}
m1.large: Memory: 8192MB, VCPUS: 4, Root: 80GB, Ephemeral: 0Gb, FlavorID: 4, Swap: 0MB, RXTX Factor: 1.0, ExtraSpecs {}
m1.tiny: Memory: 512MB, VCPUS: 1, Root: 0GB, Ephemeral: 0Gb, FlavorID: 1, Swap: 0MB, RXTX Factor: 1.0, ExtraSpecs {}
m1.xlarge: Memory: 16384MB, VCPUS: 8, Root: 160GB, Ephemeral: 0Gb, FlavorID: 5, Swap: 0MB, RXTX Factor: 1.0, ExtraSpecs {}
  • In the example above, the maximum Flavor ID is 5, so use 6 and 7.

Start Processes

(Currently, you might have trouble if run processes as a user other than the superuser...) 
$ sudo bm_deploy_server & 
$ sudo nova-scheduler & 
$ sudo nova-compute &


Register Baremetal Host and NIC

  • First, register a baremetal node. Next, register the baremetal node's NICs.
  • To register a baremetal node, use 'bm_node_create'. 'bm_node_create' takes the parameters listed below.
--service_host: baremetal nova-compute's hostname
--cpus=: number of CPU cores
--memory_mb: memory size in MegaBytes
--local_gb: local disk size in GigaBytes
--pm_address: tilera node's IP address / IPMI address
--pm_user: IPMI username
--pm_password: IPMI password
--prov_mac: tilera node's MAC address / PXE NIC's MAC address
--terminal_port: TCP port for ShellInABox. Each node must use unique TCP port. If you do not need console access, use 0.


$ bm_node_create --service_host=bm1 --cpus=64 --memory_mb=16218 --local_gb=917 --pm_address=10.0.2.1 --pm_user=test --pm_password=password --prov_mac=98:4b:e1:67:9a:4c --terminal_port=0
$ bm_node_create --service_host=bm1 --cpus=4 --memory_mb=6144 --local_gb=64 --pm_address=172.27.2.116 --pm_user=test --pm_password=password --prov_mac=98:4b:e1:67:9a:4c --terminal_port=8000


  • To verify the node registration, run 'bm_node_list':
$ bm_node_list
ID        SERVICE_HOST  INSTANCE_ID   CPUS    Memory    Disk      PM_Address        PM_User           TERMINAL_PORT  PROV_MAC            PROV_VLAN
1         bm1           None          64      16218     917       10.0.2.1          test              0   98:4b:e1:67:9a:4c   None
2         bm1           None          4       6144      64        172.27.2.116      test              8000   98:4b:e1:67:9a:4c   None


  • To register NIC, use 'bm_interface_create'. 'bm_interface_create' takes the parameters listed below.
--bm_node_id: ID of the baremetal node owns this NIC (the first column of 'bm_node_list')
--mac_address: this NIC's MAC address in the form of xx:xx:xx:xx:xx:xx
--datapath_id: datapath ID of OpenFlow switch this NIC is connected to
--port_no: OpenFlow port number this NIC is connected to
(--datapath_id and --port_no are used for network isolation. It is OK to put 0, if you do not have OpenFlow switch.)


$ bm_interface_create --bm_node_id=1 --mac_address=98:4b:e1:67:9a:4e --datapath_id=0 --port_no=0
$ bm_interface_create --bm_node_id=2 --mac_address=98:4b:e1:67:9a:4e --datapath_id=0x123abc --port_no=24


  • To verify the NIC registration, run 'bm_interface_list':
$ bm_interface_list
ID        BM_NODE_ID        MAC_ADDRESS         DATAPATH_ID       PORT_NO
1         1                 98:4b:e1:67:9a:4e   0x0               0
2         2                 98:4b:e1:67:9a:4e   0x123abc          24


Run Instance

  • Run instance using the baremetal instance type. Make sure to use kernel and image that support baremetal hardware (i.e contain drivers for baremetal hardware ).
euca-run-instances -t tp64.8x8 -k my.key ami-CCC
euca-run-instances -t bm.small --kernel aki-AAA --ramdisk ari-BBB ami-CCC