Jump to: navigation, search

Baremetal

Overview

The baremetal driver is a hypervisor driver for Openstack Nova Compute. Within the Openstack framework, it has the same role as the drivers for other hypervisors (libvirt, xen, etc), and yet it is presently unique in that the hardware is not virtualized - there is no hypervisor between the tenants and the physical hardware. It exposes hardware via Openstack's API, using pluggable sub-drivers to deliver machine imaging (PXE) and power control (IPMI). With this, provisioning and management of physical hardware is accomplished using common cloud APIs and tools, such as Heat or salt-cloud. However, due to this unique situation, using the baremetal driver requires some additional preparation of its environment.

This driver was added to the Grizzly release, but it should be considered somewhat experimental at this point. See the Bugs section for information and links to the Launchpad bug listings.

NOTE: The baremetal driver is being split out of Nova and refactored into a stand-alone project. Once Ironic reaches a stable release and graduates from incubation, the baremetal driver will begin deprecation. The log of the TC discussion about this can be found here. The proposal for the split can be found here.

Terminology

There is also some terminology which baremetal introduces.

  • Baremetal host and compute host are often used interchangeably to refer to the machine which runs the nova-compute and nova-baremetal-deploy-helper services (and possibly other services as well). This functions like a hypervisor, providing power management and imaging services.
  • Node and baremetal node refer to the physical machines which are controlled by the compute host. When a user requests that Nova start a baremetal instance, it is created on a baremetal node.
  • A baremetal instance is a Nova instance created directly on a physical machine without any virtualization layer running underneath it. Nova retains both power control (via IPMI) and, in some situations, may retain network control (via Neutron and OpenFlow).
  • Deploy image is pair of specialized kernel and ramdisk images which are used by the compute host to write the user-specified image onto the baremetal node.
  • Hardware is enrolled in the baremetal driver by adding its MAC addresses, physical characteristics (# CPUs, RAM, and disk space), and the IPMI credentials into the baremetal database. Without this information, the compute host has no knowledge of the baremetal node.

Features

The current implementation of the Baremetal driver provides the following functionality.

  • A Nova API to enroll & manage hardware in the baremetal database
  • Power control of enrolled hardware via IPMI
  • PXE boot of the baremetal nodes.
  • Support for common CPU architectures (i386, x86_64)
  • FlatNetwork environments are supported and well tested
    • OpenFlow-enabled environments should be supported, but are less well tested at this time
  • Cloud-init is used for passing user data into the baremetal instances after provisioning. Limited support for file-injection also exists, but is being deprecated.


Current limitations include:

  • A separate dnsmasq process must run on the baremetal compute host to control the PXE boot process. This conflicts with neutron-dhcp, which must therefore be disabled.
  • Cloud-init requires an instances' IP be assigned by neutron, and without neutron-dhcp, this requires file injection to set the IP statically.


Future plans include:

  • Improve performance/scalability of PXE deployment process
  • Better support for complex non-SDN environments (eg., static VLANs)
  • Better integration with neutron-dhcp
  • Support snapshot and migrate of baremetal instances
  • Support non-PXE image deployment
  • Support other architectures (arm, tilepro)
  • Support fault-tolerance of baremetal nova-compute node

Key Differences

There are several key differences between the baremetal driver and other hypervisor drivers (kvm, xen, etc).

  • There is no hypervisor running underneath the baremetal instances, so the tenant has full and direct access to the hardware, and that hardware is dedicated to a single instance.
  • Nova does not have any access to manipulate a baremetal instance except for what is provided at the hardware level and exposed over the network, such as IPMI control. Therefore, some functionality implemented by other hypervisor drivers is not available via the baremetal driver, such as: instance snapshots, attach and detach network volumes to a running instance, and so on.
  • It is also important to note that there are additional security concerns created by tenants having direct access to the network (eg., MAC spoofing, packet sniffing, etc).
    • Other hypervisors mitigate this with virtualized networking.
    • Neutron + OpenFlow can be used much to the same effect, if your network hardware supports it.
  • Public cloud images may not work on some hardware, particularly if your hardware requires additional drivers to be loaded.
  • The PXE driver requires a specialized ramdisk (and a corresponding kernel) for deployment, which is distinct from the cloud image's ramdisk. This can be built via the diskimage-builder project. The Glance UUIDs for these two images should be added to the extra_specs for any flavor (instance_type) that will be deployed onto a bare metal compute host. Alternatively, these UUIDs can also be added to the bare metal compute host's nova.conf file.

Use-cases

Here are a few ideas we have about potential use-cases for the baremetal driver. This isn't an exhaustive list -- there are doubtless many more interesting things which it can do!

  • High-performance computing clusters.
  • Computing tasks that require access to hardware devices which can't be virtualized.
  • Database hosting (some databases run poorly in a hypervisor).
  • Or, rapidly deploying a cloud infrastructure ....

We (the tripleo team) have a vision that Openstack can be used to deploy Openstack at a massive scale. We think the story of getting "from here to there" goes like this:

  • First, do simple hardware provisioning with a base image that contains configuration-management software (chef/puppet/salt/etc). The CMS checks in with a central server to determine what packages to install, then installs and configures your applications. All this happens automatically after first-boot of any baremetal node.
  • Then, accelerate provisioning by pre-installing your application software into the cloud image, but let a CMS still do all configuration.
  • Pre-install KVM and nova-compute into an image, and scale out your compute cluster by using baremetal driver to deploy nova-compute images. Do the same thing for Swift, proxy nodes, software load balancers, and so on.
  • Use Heat to orchestrate the deployment of an entire cloud.
  • Finally, run a mixture of baremetal nova-compute and KVM nova-compute in the same cloud (shared keystone and glance, but different tenants). Continuously deploy the cloud from the cloud using a common API.


The Baremetal Deployment Process

This section is a stub and needs to be expanded.


Differences in Starting a Baremetal Cloud

This section aims to cover the technical aspects of creating a bare metal cloud without duplicating the information required in general to create an openstack cloud. It assumes you already have all the other services -- MySQL, Rabbit, Keystone, Glance, etc -- up and running, and then covers:

  • Nova configuration changes
  • Additional package requirements
  • Extra services that need to be started
  • Images, Instance types, and metadata that need to be created and defined
  • Enrolling your hardware

Configuration Changes

The following nova configuration options should be set on the compute host, in addition to any others that your environment requires.

[DEFAULT]
scheduler_host_manager = nova.scheduler.baremetal_host_manager.BaremetalHostManager
firewall_driver = nova.virt.firewall.NoopFirewallDriver
compute_driver = nova.virt.baremetal.driver.BareMetalDriver
ram_allocation_ratio = 1.0
reserved_host_memory_mb = 0

[baremetal]
net_config_template = $pybasedir/nova/virt/baremetal/net-static.ubuntu.template
tftp_root = /tftpboot
power_manager = nova.virt.baremetal.ipmi.IPMI
driver = nova.virt.baremetal.pxe.PXE
instance_type_extra_specs = cpu_arch:{i386|x86_64}
sql_connection = mysql://{user}:{pass}@{host}/nova_bm

A few notes here:

  • the cpu_arch here is literally "{i386|x86_64}"; you will need to use this arch again below. Don't try to pick just one!
  • the net_config_template here sets a static config, which is a good place to start; you could also use net-dhcp.ubuntu.template for a DHCP config.

Additional Packages

If using the default baremetal driver (PXE) and default power driver (IPMI), then the baremetal compute host(s) must have the following packages installed to enable image deployment and power management.

 dnsmasq ipmitool open-iscsi syslinux

Additionally, to support PXE image deployments, the following steps should be taken:

 sudo mkdir -p /tftpboot/pxelinux.cfg
 sudo cp /usr/lib/syslinux/pxelinux.0 /tftpboot/
 sudo chown -R $NOVA_USER /tftpboot
 
 sudo mkdir -p $NOVA_DIR/baremetal/dnsmasq
 sudo mkdir -p $NOVA_DIR/baremetal/console
 sudo chown -R $NOVA_USER $NOVA_DIR/baremetal

Services

At a minimum, Keystone, Nova, Glance, and Neutron must be up and running. The following additional services are currently required for baremetal deployment, and should be started on the nova compute host.

  • nova-baremetal-deploy-helper. This service assists with image deployment. It reads all necessary options from nova.conf.
  • dnsmasq. Currently, this must run on the nova compute host. The baremetal PXE driver interacts directly with the dnsmasq configuration file and modifies the TFTP boot files that dnsmasq serves.

Start dnsmasq as follows:

 # Disable any existing dnsmasq service
 sudo service dnsmasq disable && sudo pkill dnsmasq
 
 # Start dnsmasq for baremetal deployments. Change IFACE and RANGE as needed.
 # Note that RANGE must not overlap with the instance IPs assigned by Nova or Neutron.
 sudo dnsmasq --conf-file= --port=0 --enable-tftp --tftp-root=/tftpboot \
   --dhcp-boot=pxelinux.0 --bind-interfaces --pid-file=/var/run/dnsmasq.pid \
   --interface=$IFACE --dhcp-range=$RANGE

NOTE: This dnsmasq process must be the only process on the network answering DHCP requests from the MAC addresses of the enrolled bare metal nodes. If another DHCP server answers the PXE boot, deployment is likely to fail. This means that you must disable neutron-dhcp. Work on this limitation is planned for the Havana cycle.


A separate database schema must be created for the baremetal driver to store information about the enrolled hardware. Create it first:

 mysql> CREATE DATABASE nova_bm;
 mysql> GRANT ALL ON nova_bm.* TO 'nova_user'@'some_host' IDENTIFIED BY '$password';

Then initialize the database with:

 nova-baremetal-manage db sync

Image Requirements

The diskimage-builder project is provided as a toolchain for customizing and building both run-time images and the deployment images used by the PXE driver. Customization may be necessary if, for example, your hardware requires drivers not enabled or included in the default images.

Diskimage-builder requires the following packages be installed:

 python-lxml python-libvirt libvirt-bin qemu-system

To build images, clone the project and run the following:

 git clone https://github.com/openstack/diskimage-builder.git
 cd diskimage-builder
 
 # build the image your users will run
 bin/disk-image-create -u base -o my-image
 # and extract the kernel & ramdisk
 bin/disk-image-get-kernel -d ./ -o my -i $(pwd)/my-image.qcow2
 
 # build the deploy image
 bin/ramdisk-image-create deploy -a i386 -o my-deploy-ramdisk

Load all of these images into Glance, and note the glance image UUIDs for each one as it is generated. These are needed for associating the images to each other, and to the special baremetal flavor.

 glance image-create --name my-vmlinuz --public --disk-format aki  < my-vmlinuz
 glance image-create --name my-initrd --public --disk-format ari  < my-initrd
 glance image-create --name my-image --public --disk-format qcow2 --container-format bare \
     --property kernel_id=$MY_VMLINUZ_UUID --property ramdisk_id=$MY_INITRD_UUID < my-image
 
 glance image-create --name deploy-vmlinuz --public --disk-format aki < vmlinuz-$KERNEL
 glance image-create --name deploy-initrd --public --disk-format ari < my-deploy-ramdisk.initramfs

You will also need to create a special baremetal flavor in Nova, and associate both the deploy kernel and ramdisk with that flavor via the "baremetal" namespace.

 # pick a unique number
 FLAVOR_ID=123
 # change these to match your hardware
 RAM=1024
 CPU=2
 DISK=100
 nova flavor-create my-baremetal-flavor $FLAVOR_ID $RAM $DISK $CPU
 
 # associate the deploy images with this flavor
 # cpu_arch must match nova.conf, and of course, also must match your hardware
 nova flavor-key my-baremetal-flavor set \
   cpu_arch={i386|x86_64} \
   "baremetal:deploy_kernel_id"=$DEPLOY_VMLINUZ_UUID \
   "baremetal:deploy_ramdisk_id"=$DEPLOY_INITRD_UUID

Hardware Enrollment

The last step is to enroll your physical hardware with the baremetal cloud. To do this, we need to give the baremetal driver some general information (# CPUs, RAM, and disk size) and also specify every MAC address which might send PXE/DHCP request. If you are using the IPMI power driver, you must also input the IP, user, and password for each node's IPMI interface. This can all be done via a Nova API admin extension. You must also inform the baremetal driver which Nova compute host should control the bare metal node.

 # create a "node" for each machine
 # extract the "id" from the result and use that in the next step
 nova baremetal-node-create --pm_address=... --pm_user=... --pm_password=... \
   $COMPUTE-HOST-NAME $CPU $RAM $DISK $FIRST-MAC
 
 # for each NIC on the node, including $FIRST-MAC, also create an interface
 nova baremetal-interface-add $ID $MAC

Once the hardware is enrolled in the baremetal driver, the Nova compute process will broadcast the availability of a new compute resource to the Nova scheduler during the next periodic update, which by default occurs once a minute. After that, you will be able to provision the hardware with a command such as the following:

 nova boot --flavor my-baremetal-flavor --image my-image my-baremetal-node

Bugs

Bugs should be tagged with the keyword "baremetal" within the Nova project in Launchpad. To see the list of known baremetal bugs, go to https://bugs.launchpad.net/nova/+bugs?field.tag=baremetal+

When reporting bugs, please include any relevant information about your hardware and network environment (sanitize IPs and MAC addresses as necessary), and any relevant snippets from log the nova-compute, nova-scheduler, and nova-baremetal-deploy-helper log files. Please also include the database records for the nova instance, nova compute record, baremetal node, and the tftp configuration file. Below is a simple script to extract that information from the "nova" and "nova_bm" schema, as well as from the filesystem on the nova-compute host.

 cat > get_baremetal_crash_info.sh <<'EOF'
 #!/bin/bash
 
 id=$1
 node=$(mysql nova -NBre "select node from instances where uuid='$id'")
 conf=$(mysql nova_bm -e "select updated_at, task_state, pxe_config_path from bm_nodes where instance_uuid='$1'" | awk "/$id/ {print \$4}")
 
 echo "=========== COMPUTE NODE ===========" 
 mysql nova -e "select hypervisor_hostname, created_at, updated_at, deleted_at, vcpus, memory_mb, local_gb, vcpus_used, memory_mb_used, local_gb_used, hypervisor_type, cpu_info, free_ram_mb, free_disk_gb, running_vms from compute_nodes where hypervisor_hostname='$node'\G"
 echo
 echo "=========== COMPUTE INSTANCE ==========="
 mysql nova -e "select node, created_at, updated_at, deleted_at, image_ref, kernel_id, ramdisk_id, scheduled_at, launched_at, updated_at, launched_on, vm_state, power_state, task_state, memory_mb, vcpus, root_gb, ephemeral_gb from instances where uuid='$id'\G"
 echo
 echo "=========== BAREMETAL NODE ==========="
 mysql nova_bm -e "select uuid, created_at, updated_at, deleted_at, cpus, memory_mb, local_gb, root_mb, swap_mb, service_host, instance_uuid, instance_name, task_state, pxe_config_path from bm_nodes where instance_uuid='$id'\G"
 echo
 echo "=========== TFTP CONFIG ==========="
 cat $conf < /dev/null
 EOF
 
 chmod +x get_baremetal_crash_info.sh
 ./get_baremetal_crash_info.sh <your-instance-uuid-here>

Community

NOTE: Information regarding the work done in Folsom by USC/ISI and NTT-Docomo has been moved here.

Troubleshooting

1. If you're using nova to provision real HW as a baremetal node, double check the baremetal node's information in the nova compute host.

- The PM address, PM username and PM password should match whatever is configured in your real HW's BIOS/IPMI configuration
- The PM address should be reachable from the nova compute host (ping 192.168.7.4 generates a response, in the example below)
$nova baremetal-node-list
+----+------+------+-----------+---------+-------------------+------------+-------------+-------------+---------------+
| ID | Host | CPUs | Memory_MB | Disk_GB | MAC Address       | PM Address | PM Username | PM Password | Terminal Port |
+----+------+------+-----------+---------+-------------------+------------+-------------+-------------+---------------+
| 5  | os2  | 1    | 1024      | 20      | 00:XX:XX:XX:XX:86 | 192.168.7.4    | ADMIN       |             | None          |
+----+------+------+-----------+---------+-------------------+------------+-------------+-------------+---------------+
$

2. Do you have enough resources in the baremetal node to provision the new instance?

After "nova baremetal-node-list" shows your inventory, but before you run the "nova boot" command, look for something in nova-compute.log that looks like

2013-10-18 18:17:09,225.225 4466 INFO nova.compute.manager [-] Updating bandwidth usage cache
2013-10-18 18:22:52,753.753 4466 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2013-10-18 18:22:52,787.787 4466 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 2048
2013-10-18 18:22:52,787.787 4466 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 10
2013-10-18 18:22:52,787.787 4466 AUDIT nova.compute.resource_tracker [-] Free VCPUS: 1

If you see VCPUs/Free ram/disk is 0, then nova boot is going to fail because of insufficient resources.

When you execute the "nova boot ..." command, the following actions take place..

3. Image/kernel/ramdisk/deploy-kernel/deploy-ramdisk copying

  • the deploy kernel/ramdisk and image kernel/ramdisk are copied from glance to /tftpboot/<UUID-of-baremetal-server>
  • qcow2 image from glance is copied to /var/lib/nova/instance/instance-<ID>/disk.part
  • qemu-img is used to convert the qcow2 image to raw, saved in /var/lib/nova/instance/instance-<ID>/disk.converted

Look for messages in nova-compute.log that look like (particularly look for "Result was 0"; a non-zero result usually indicates an error) ..

2013-10-23 12:08:01.542 DEBUG nova.virt.baremetal.pxe Fetching kernel and ramdisk for instance instance-00000067 _cache_tftp_images /usr/lib/python2.7/dist-packages/nova/virt/baremetal/pxe.py:257
2013-10-23 12:08:28.888 DEBUG nova.utils Running cmd (subprocess): env LC_ALL=C LANG=C qemu-img info /tftpboot/be38553e-2087-478b-b9ce-273e8183e2a6/kernel.part execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-23 12:08:28.900 DEBUG nova.utils Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-23 12:08:39.388 DEBUG nova.virt.baremetal.pxe Fetching image 115a0cb9-7e6e-48cd-9ec3-4a541153c3ed for instance instance-00000067 _cache_image /usr/lib/python2.7/dist-packages/nova/virt/baremetal/pxe.py:289

All this while, "nova list" shows status "BUILD"

4. If all 5 files are successfully copied and placed in the right location, the BM node is power cycled using IPMI.

Look for logs in nova-compute.log that look like..

2013-10-18 11:48:16.710 16781 DEBUG nova.utils [-] Running cmd (subprocess): ipmitool -I lanplus -H 192.168.7.4 -U ADMIN -f /tmp/tmpLjEgmQ power status execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 11:48:16.809 16781 DEBUG nova.utils [-] Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-18 11:48:16.810 16781 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is on
', stderr:  _exec_ipmitool /usr/lib/python2.7/dist-packages/nova/virt/baremetal/ipmi.py:135
2013-10-18 11:48:16.810 16781 DEBUG nova.utils [-] Running cmd (subprocess): ipmitool -I lanplus -H 192.168.7.4 -U ADMIN -f /tmp/tmp2pFcxW power off execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 11:48:16.826 16781 DEBUG nova.utils [-] Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-18 11:48:16.827 16781 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power Control: Down/Off
', stderr:  _exec_ipmitool /usr/lib/python2.7/dist-packages/nova/virt/baremetal/ipmi.py:135
2013-10-18 11:48:17.827 16781 DEBUG nova.utils [-] Running cmd (subprocess): ipmitool -I lanplus -H 192.168.7.4 -U ADMIN -f /tmp/tmpyuES1P power status execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 11:48:17.844 16781 DEBUG nova.utils [-] Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-18 11:48:17.844 16781 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is on
2013-10-18 11:48:32.176 16781 DEBUG nova.utils [-] Running cmd (subprocess): ipmitool -I lanplus -H 192.168.7.4 -U ADMIN -f /tmp/tmp1mDmnB power status execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 11:48:32.194 16781 DEBUG nova.utils [-] Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-18 11:48:32.195 16781 DEBUG nova.virt.baremetal.ipmi [-] ipmitool stdout: 'Chassis Power is off


5. Once the BM node is powered back on (look for IPMI logs that keep checking the status to see if it is back on), the BM node gets a new IP address (not the 192.168.7.4 address used for IPMI) via DHCP. PXEboot kicks in and the deploy-kernel and deploy-ramdisk are used to boot the BM node. The BM node comes up with iscsid listening on port 3260, and the local storage device is configured as an iSCSI target.

Look in nova-baremetal-deploy-helper.log for entries like..

2013-10-18 14:09:12.023 DEBUG nova.utils Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.7.3:3260 execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 14:09:12.104 DEBUG nova.utils Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-18 14:09:12.104 DEBUG nova.utils Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf iscsiadm -m node -p 192.168.7.3:3260 -T iqn-eddbcb64-698b- 4f7b-98a9-d585988c8e9e --login execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 14:09:12.656 DEBUG nova.utils Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232
2013-10-18 14:09:15.656 DEBUG nova.utils Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf fdisk /dev/disk/by-path/ip-192.168.7.3:3260-iscsi-iqn-eddbcb64-698b-4f7b-98a9-d585988c8e9e-lun-1 execute /usr/lib/python2.7/dist-packages/nova/utils.py:208
2013-10-18 14:09:15.803 DEBUG nova.utils Result was 0 execute /usr/lib/python2.7/dist-packages/nova/utils.py:232


If you see a non-zero Result here, check if the file /etc/nova/rootwrap.d/baremetal-deploy-helper.filter is missing. If it is, you can get it here - https://github.com/openstack/nova/blob/master/etc/nova/rootwrap.d/baremetal-deploy-helper.filters

6. If there are no errors during the iSCSI phase, nova-baremetal-deploy-helper.log will show an entry "deployment to node <ID> done"


tail -f /var/log/upstart/nova-compute.log /var/log/upstart/nova-baremetal-deploy-helper.log can be very helpful. You may see bug https://bugs.launchpad.net/nova/+bug/1177596, workaround instructions are in the bug. If things error in the scheduler (no nova-compute errors), be sure that:

  1. The architecture matches in nova.conf and the flavor
  2. the hostname for the nova-compute process matches the baremetal node registration
  3. nova-compute is running and has updated it's resources to the scheduler