Heterogeneous instance types

Note: This has been superseded by ScheduleHeterogeneousInstances

Launchpad Entry: NovaSpec:heterogeneous-instance-types
Creator: Brian Schott
Current maintainer: Lorin Hochstein
Contributors: USC Information Sciences Institute

Summary

Nova should have support for cpu architectures, accelerator architectures, and network interfaces as part of the definition of an instance type (or flavor using RackSpace API parlance). The target release for this is Diablo, however the USC-ISI team intends to have a stable test branch and deployment at Cactus release.

The USC-ISI team has a functional prototype here:

https://code.launchpad.net/~usc-isi/nova/hpc-trunk (usually in sync with nova/trunk)
https://code.launchpad.net/~usc-isi/nova/hpc-testing (a little older, but more stable)

The architecture-aware scheduler is blueprinted here:

HeterogeneousArchitectureScheduler

We are also drafting blueprints for three machine types:

An etherpad for discussion of this blueprint is available at http://etherpad.openstack.org/heterogeneousinstancetypes

Release Note

Nova has been extended to allow deployments to advertise and users to request specific processor, accelerator, and network interface options using instance_types (or flavors).

The nova-manage instance_types command supports additional fields:

cpu_arch - processor architecture. Ex: "x86_64", "i386", "P7", etc. (default x86_64)
cpu_info - json-formatted extended processor information
xpu_arch - accelerator architecture Ex: "fermi" (default "")
xpu_info - json-formatted extended accelerator information
xpus - Number of accelerators or accelerator processors
net_arch - primary network interface. Ex: "ethernet", "infiniband", "myrinet"
net_info - json-formatted extended network information
net_mbps - allocated network bandwidth (megabits per second)

Amazon GPU Node Example:

22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture) 2 x NVIDIA Tesla “Fermi” M2050 GPUs 1690 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) API name: cg1.4xlarge

cg1.4xlarge:
 * memory_mb= 22000
 * vcpus = 8
 * local_gb = 1690
 * cpu_arch = "x86_64"
 * cpu_info = '{"model":"Nehalem", "features":["tdtscp", "xtpr"]}' 
 * xpu_arch = "fermi"
 * xpus = 2
 * xpu_info ='{"model":"Tesla 2050", "gcores":"448"}'
 * net_arch = "ethernet"
 * net_info = '{"encap":"Ethernet", "MTU":"8000"}'
 * net_mbps = 10000

Rationale

Currently AWS supports two different CPU architecture types, "i386" and "x86_64". In addition, AWS describes many other instance type attributes by reference, such as: I/O Performance: (Moderate/High/Very High 10Gigabit Ethernet), extended CPU information (Intel Xeon X5570, quad-core “Nehalem” architecture), and now GPU accelerators (2 x NVIDIA Tesla “Fermi” M2050 GPUs). In order to implement similar functionality in nova, we need to capture this in a way that is accessible to advanced schedulers.

There are several related blueprints:

User stories

Mary manages a cloud datacenter. In addition to her x86 blades, she wants to advertise her power7 high performance computing cloud with 40Gbit QDR Infiniband support to customers. Mary uses nova-manage instance_types create to define "p7.sippycup", "p7.tall", "p7.grande", and "p7.venti" with cpu_arch="power7" and an increasing number of default memory, storage, cores, and reserved bandwidth. Mary also has a small number of GPU-accelerated systems, so she defines "p7f.grande" and "p7f.venti" options with xpu_arch="fermi", and xpus = 1 for grande and xpus = 2 for venti.

Fred wants to run an 8 core machine with 1 fermi-based GPU accelerator. He looks on Mary's web site for text description, then wants the p7f.grande virtual machine. He runs:

euca-run-instances -t p7f.grande -k fred-keypair emi-12345678

Assumptions

This assumes that someone has ported OpenStack to different processor architecture systems and that accelerators such as GPUs can be passed through to the virtual instance. The USC-ISI team is working on this. We have linked in related blueprints, but the goal is that this top-level cpu architecture awareness stands alone.

Design

We propose to add cpu_arch, cpu_info, xpu_arch, xpu_info, xpus, net_arch, net_info, and net_mbps as attributes to instance_types, instances, and compute_nodes tables. Conceptually, this information is treated the same way that existing memory_mb, local_gb, vcpus fields are handled. They exist in "instance_types" and get copied as columns into "instances" table as instances are created.

The architecture aware scheduler will compare these additional fields when selecting target compute_nodes (nova-compute services).

cpu_arch, xpu_arch, and net_arch are intended to be high-level label switches for fast row filtering (like "i386" or "fermi" or "infiniband").
xpus and net_mbps are treated as quantity fields exactly like vcpus is used by schedulers
the cpu_info, xpu_info, and net_info follows the instance_migration branch example using a json formatted string to capture arbitrary configurations.

The context for these new fields changes slightly according to what table they are in.

In instance_types table, they represent advertised capabilities for the machine type, such as "this instance type provides 100 megabit bandwidth" or "this instance type supports Cortex-A9 processors".
In the instances table, they represent requested capabilities, such as "give me an instance with xpu_arch=fermi and xpus=2".
In the compute_nodes table, the fields represent the available resources of the host associated with the compute_nodes.

The processor architecture functionality cpu_arch is a no-brainer. Lots of deployments will want this today. Adding cpu_info is for many-core processors such as our Tilera systems on our project. We need to specify things like instance_type.cpu_info("geometry":"4x4") to be able to spatially tile multiple virtual machines on the 8x8 tilemp processor. It's easiest to define what "tile.small", "tile.medium", and "tile.large" mean within instance types.

Accelerators are also important, but instead of having dedicated GPU-related fields the design is trying to support other future accelerators like FPGAs, optical processors, whatever dedicated hardware resource that can get passed through to the virtual machine. The xpus quantity field is pretty crude and can't easily handle a box with 2 different kinds of accelerators, but this could be broken out later as a separate one-many relational table. We are trying to minimize the schema changes.

The networking fields attempt to promote network connectivity to be equal to cores, memory, and disk for selecting on what host instances get deployed. Enforcement of bandwidth at the VM would be nice, but even if we use "divide network bandwidth by number of instances" metric at scheduler it would be better than nothing. Also, the networking service will add another layer of complexity, but at least with this blueprint the networking service will know how much bandwidth an instance is requesting or has been allocated on the host.

We may want to consider additional top-level column fields in these tables for scheduler performance purposes, like cpu_model and xpu_model, but these are enhancements.

Supporting multiple accelerators

The proposed approach would only support one type of accelerator per machine. For example, you could have GPUs in the machine, or FPGAs, but not both. To support multiple accelerators, we would either need:

A separate table that contained accelerator information
Leverage the extra-data approach.

The separate table approach would make the implementation in the code simpler.

Information Flow

The basic information flow through nova is as follows:

nova-compute starts on a host and registers architecture, accelerator, and networking capabilities in the ComputeNode table. This functionality is provided by the instance migration blueprint and is already merged. We need to add our new fields and populate them in the compute_services table using flags and/or extracted /proc information
nova-api receives a run-instances request with instance_type string "m1.small" or "p7g.grande". No change here.
nova-api passes instance_type to compute/api.py create() from api/ec2/cloud.py run_instances() or api/openstack/servers.py create(). No change here.
nova-api compute/api.py create() reads from instance_types table and adds rows to instances table. We need to insert our new fields into base_options arg that gets passed to instances.db.create(). This might also be a good place to insert a sanity check of the image cpu architecture supports cpu_arch.
nova-api does an rpc.cast() to scheduler num_instances times, passing instance_id. No change here.
nova-scheduler selects compute_service host that matches the options specified in the instance table fields. The simple scheduler will just work correctly and ignore these fields on a homogeneous deployment. We need to add an arch scheduler that filters available compute_nodes by cpu_arch, cpu_info, xpu_arch, xpu_info, xpus, net_arch, net_info, and net_mbps with the same fields .
nova-scheduler rpc.cast() to each selected compute service. No change here.
nova-compute receives rpc.cast() with instance_id, launches the virtual machine, etc. At this point, nova-compute has cpu_arch, cpu_info, xpu_arch, xpu_info, xpus, net_arch, net_info, and net_mbps fields in instance object and can configure libvirt as needed. No change required for existing compute service manager. USC-ISI team is adding GPU and other non-x86 architecture support (need to add blueprint references).

Schema Changes

Three tables are extended:

InstanceTypes

The instance_types are now stored in their own table in nova trunk: ConfigureInstanceTypesDynamically

class InstanceTypes(BASE, NovaBase):
    """Represent possible instance_types or flavor of VM offered"""
    __tablename__ = "instance_types"
    id = Column(Integer, primary_key=True)
    name = Column(String(255), unique=True)
    memory_mb = Column(Integer)
    vcpus = Column(Integer)
    local_gb = Column(Integer)
    flavorid = Column(Integer, unique=True)
    swap = Column(Integer, nullable=False, default=0)
    rxtx_quota = Column(Integer, nullable=False, default=0)
    rxtx_cap = Column(Integer, nullable=False, default=0)
+    cpu_arch = Column(String(255), default='x86_64')
+    cpu_info = Column(String(255), default='')
+    xpu_arch = Column(String(255), default='')
+    xpu_info = Column(String(255), default='')
+    xpus = Column(Integer, nullable=false, default=0)
+    net_arch = Column(String(255), default='')
+    net_info = Column(String(255), default='')
+    net_mbps = Column(Integer, nullable=false, default=0)

Compute Nodes

The compute nodes table is being included by: https://code.launchpad.net/~nttdata/nova/live-migration

class ComputeNode(BASE, NovaBase):
    """Represents a running compute service on a host."""
...
    hypervisor_type = Column(Text, nullable=True)
    hypervisor_version = Column(Integer, nullable=True)
    cpu_info = Column(Text, nullable=True)
+    cpu_arch = Column(String(255), default='x86_64')
+    xpu_arch = Column(String(255), default='')
+    xpu_info = Column(String(255), default='')
+    xpus = Column(Integer, default=0)
+    net_arch = Column(String(255), default='')
+    net_info = Column(String(255), default='')
+     net_mbps = Column(Integer, default=0)

Instance

Instances table just carries the additional fields so that libvirt_conn can pick them up. This is also used by the scheduler like vcpus.

 class Instance(BASE, NovaBase):
     """Represents a guest vm."""
.... 
     instance_type = Column(String(255))
+    cpu_arch = Column(String(255), default='x86_64')
+    cpu_info = Column(String(255), default='')
+    xpu_arch = Column(String(255), default='')
+    xpu_info = Column(String(255), default='')
+    xpus = Column(Integer, default=0)
+    net_arch = Column(String(255), default='')
+    net_info = Column(String(255), default='')
+    net_mbps = Column(Integer, default=0)

Implementation

The USC-ISI team has a functional prototype: https://code.launchpad.net/~usc-isi/nova/hpc-trunk

UI Changes

There are no UI changes exposed to cloud users. They access the functionality through instance_types/flavors.

For administrators, we should add the fields to "nova-manage instance_types create/list" command. One question is how to handle the json text fields for user entry, but straight text isn't too bad. Need to decide if other nova-manage describe resources should show all of this to end users or bury as advanced/verbose argument to the command.

There are also additional flags available in nova.conf for specifying cpu_arch, xpu_arch, net_arch when a compute service is launched.

Code Changes

Summary of changes:

nova/db/sqlalchemy/models.py

   - Schema changes for ComputeNode, Instance, and InstanceType

nova/db/sqlalchemy/migrate_repo/versions/013_add_architecture_to_instance_types.py

   - Migration code is such fun.

nova/db/sqlalchemy/migrate_repo/versions/014_add_architecture_to_instances.py

   - Migration code is such fun.

nova/db/sqlalchemy/migrate_repo/versions/015_add_architecture_to_compute_node.py

   - Migration code is such fun.

nova/compute/manager.py

   - Flags for default values inserted in ComputeNode
   - Periodic updates to ComputeNode

nova/compute/api.py

   - Added fields to base_options copied into Instances table

Migration

Very little needs to change in terms of the way deployments will use this if we set sane defaults like "x86_64" as assumed today.

Test/Demo Plan

This need not be added or completed until the specification is nearing beta.

Unresolved issues

This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.

HeterogeneousInstanceTypes

Contents