BaremetalOperationsSpec

= Baremetal Operations Specification =

This page outlines the work for the Havana cycle to add support for commonly-requested operational tasks to Nova's Baremetal Driver.

Overview
We (TripleO team) have been discussing the baremetal driver's current functionality with several ops teams interested in using it for fleet management. Broadly speaking, these are their requirements:
 * auto-discovery of hardware
 * initial hardware configuration (BIOS & RAID settings, etc)
 * hardware burn-in
 *  Need to account for failure in burn-in. Also how to handle rediscovering failed hardware once repaired. --nobodycam (talk) 22:27, 6 March 2013 (UTC)


 * firmware updates
 * console access to problematic hardware

The Plan
By using diskimage-builder to create task-specific ramdisks, I believe all of these can be addressed with only small changes outside of the baremetal driver.
 * Scheduler awareness of nodes and the ability to perform an action on a specific node, eg.
 * API call(s) to find a node by instance name or UUID, and to find an instance by node UUID.
 * A way to set / update the designated rescue image for a given instance.
 * Alternately, this may be done via instance_type extra_specs, if we decide to set the rescue image per-flavor instead of per-instance.

There will also be some changes inside the baremetal driver.
 * Add new config options for a discovery kernel and ramdisk, which will be served in response to a DHCP request from unregistered MAC addresses.
 *  We may want to have an option where MAC address can be enrolled in a known but undiscovered state. So not every unknown MAC is treated as a resource to be discovered. --nobodycam (talk) 22:12, 6 March 2013 (UTC)


 * A means for the discovery ramdisk to post information back to the baremetal database, eg. via an HTTP POST to, or a call to the Nova baremetal API extension, or something similar. Additionally, this requires:
 * a logical separation between discovered hardware and enrolled hardware (this may just be a flag on the  table),
 * a config option to determine whether discovered hardware is automatically enrolled,
 * and an API call to toggle the 'enrolled' flag.
 * Add support to  and   for applying a rescue image to a running instance. This should:
 * update the PXE config for that node to refer to the rescue image,
 * reboot the node into the rescue image,
 * reset the PXE config to its previous state,
 * and finally set the nova state back to RUNNING when the rescue operation is complete.