Difference between revisions of "HeterogeneousGpuAcceleratorSupport"
Line 2: | Line 2: | ||
* '''Launchpad Entry''': [[NovaSpec]]:heterogeneous-gpu-accelerator-support | * '''Launchpad Entry''': [[NovaSpec]]:heterogeneous-gpu-accelerator-support | ||
* '''Created''': [https://launchpad.net/~bfschott Brian Schott] | * '''Created''': [https://launchpad.net/~bfschott Brian Schott] | ||
+ | * '''Current maintainer''': [https://launchpad.net/~jwalters-isi John Paul Walters] | ||
* '''Contributors''': [https://launchpad.net/~USC-ISI USC Information Sciences Institute] | * '''Contributors''': [https://launchpad.net/~USC-ISI USC Information Sciences Institute] | ||
Revision as of 20:52, 18 April 2011
- Launchpad Entry: NovaSpec:heterogeneous-gpu-accelerator-support
- Created: Brian Schott
- Current maintainer: John Paul Walters
- Contributors: USC Information Sciences Institute
Summary
This blueprint proposes to add support for GPU-accelrated machines as an alternative machine type in OpenStack. This blueprint is dependent on the schema changes described in the HeterogeneousInstanceTypes blueprint and the scheduler in HeterogeneousArchitectureScheduler.
The target release for this is Diablo, however the USC-ISI team intends to have a stable test branch and deployment at Cactus release.
The USC-ISI team has a functional prototype here:
- https://code.launchpad.net/~usc-isi/nova/hpc-trunk
- https://code.launchpad.net/~usc-isi/nova/hpc-testing
This blueprint is related to the HeterogeneousInstanceTypes blueprint here:
We are also drafting blueprints for other machine types:
- http://wiki.openstack.org/HeterogeneousSgiUltraVioletSupport
- http://wiki.openstack.org/HeterogeneousTileraSupport
An etherpad for discussion of this blueprint is available at http://etherpad.openstack.org/heterogeneousultravioletsupport
Release Note
Nova has been extended to make NVIDIA GPUs available to provisioned instances for CUDA programming.
Rationale
See HeterogeneousInstanceTypes.
The goal of this blueprint is to allow GPU-accelerated computing in OpenStack.
User stories
Jackie has a CUDA-accelerated application and wants to run it on an instance that has access to GPU hardware. She chooses a cg1.4xlarge instance type, that provides access to two NVIDIA Fermi GPUs:
euca-run-instances -t cg1.4xlarge -k jackie -keypair emi-12345678
Assumptions
This blueprint is dependent on cg1.4xlarge and cg1.8xlarge being selectable instance types and that the scheduler knows that this instance must get routed to a machine with GPU accelerator attached. See HeterogeneousArchitectureScheduler.
The only approach that has been successful for CUDA access from a kvm virtual machine that we know of is gVirtuS [1]. We are actively looking for alternative approaches with kvm or XEN. We assume that library has been installed.
Design
We propose to add cpu_arch, cpu_info, xpu_arch, xpu_info, xpus, net_arch, net_info, and net_mbps as attributes to instance_types, instances, and compute_nodes tables. See HeterogeneousInstanceTypes.
We have added the necessary gVirtuS hooks for libvirt and have augmented nova.virt.libvirt_conn to instantiate a GPU-enabled virtual machine when requested.
Schema Changes
See HeterogeneousInstanceTypes.
We're proposing the following default values added to the instance_types table:
# x86+GPU # TODO: we need to identify machine readable string for xpu arch 'cg1.small': dict(memory_mb=2048, vcpus=1, local_gb=20, flavorid=100, cpu_arch="x86_64", xpu_arch="fermi", xpus=1), 'cg1.medium': dict(memory_mb=4096, vcpus=2, local_gb=40, flavorid=101, cpu_arch="x86_64", xpu_arch="fermi", xpus=1), 'cg1.large': dict(memory_mb=8192, vcpus=4, local_gb=80, flavorid=102, cpu_arch="x86_64", xpu_arch="fermi", xpus=1, net_mbps=1000), 'cg1.xlarge': dict(memory_mb=16384, vcpus=8, local_gb=160, flavorid=103, cpu_arch="x86_64", xpu_arch="fermi", xpus=1, net_mbps=1000), 'cg1.2xlarge': dict(memory_mb=16384, vcpus=8, local_gb=320, flavorid=104, cpu_arch="x86_64", xpu_arch="fermi", xpus=2, net_mbps=1000), 'cg1.4xlarge': dict(memory_mb=22000, vcpus=8, local_gb=1690, flavorid=105, cpu_arch="x86_64", cpu_info='{"model":"Nehalem"}', xpu_arch="fermi", xpus=2, xpu_info='{"model":"Tesla 2050", "gcores":"448"}', net_arch="ethernet", net_mbps=10000), 'cg1.8xlarge': dict(memory_mb=22000, vcpus=8, local_gb=1690, flavorid=105, cpu_arch="x86_64", cpu_info='{"model":"Nehalem"}', xpu_arch="fermi", xpus=2, xpu_info='{"model":"Tesla 2050", "gcores":"448"}', net_arch="ethernet", net_mbps=10000),
Implementation
The USC-ISI team has a functional prototype: https://code.launchpad.net/~usc-isi/nova/hpc-trunk
Our approach currently leverages the gVirtuS drivers: http://osl.uniparthenope.it/projects/gvirtus/
UI Changes
The following will be available as new default instance types.
GPUs (NVIDIA Teslas)
Available resources per physical node: 8 cores, 24-4=20 GB RAM, 1000GB - 100GB = 900 GB. These match the non-GPU small, medium, large, xlarge, 2xlarge, 4xlarge definitions. In addition, the cg1.2xlarge is the same as Amazon GPU node definition. The cpu_arch is "x86_64" and the xpu_arch is "fermi".
GPU small
- API name: cg1.small
- 1 Fermi GPU
- 2 GB RAM (2048 MB)
- 1 virtual core
- 20 GB of instance storage
GPU medium
- API name:cg1.medium
- 1 Fermi GPUs
- 4 GB RAM (4096 MB)
- 2 virtual cores
- 40 GB of instance storage
GPU large
- API name: cg1.large
- 1 Fermi GPUs
- 8 GB RAM (8192 MB)
- 4 virtual cores
- 80 GB of instance storage
GPU xlarge
- API name: cg1.xlarge
- 1 Fermi GPUs
- 8 GB RAM (8192 MB)
- 8 virtual cores
- 160 GB of instance storage
GPU 2xlarge
- API name: cg1.2xlarge
- 2 Fermi GPUs
- 16 GB RAM (16384 MB)
- 8 virtual cores
- 320 GB of instance storage
GPU 4xlarge
- API name: cg1.4xlarge
- 2 Fermi GPUs
- 22 GB RAM (22000 MB)
- 8 virtual cores
- 1.6 TB (1690MB) of instance storage
GPU 8xlarge
- API name: cg1.8xlarge
- 4 Fermi GPUs
- 22 GB RAM (22000 MB)
- 8 virtual cores
- 1.6 TB (1690MB) of instance storage
Code Changes
- db/sqlalchemy/migrate_repo/versions/013_add_architecture_to_instance_types.py
- add default instance types for shared memory systems
- nova/virt/libvirt_conn.py
- add code to support starting/stopping the gVirtus driver
- Also requires supported qemu, the gVirtus host/VM driver, and libserial
Migration
Very little needs to change in terms of the way deployments will use this if we set sane defaults like "x86_64" as assumed today.
Test/Demo Plan
This need not be added or completed until the specification is nearing beta.
Unresolved issues
One of the challenges we have is that the flavorid field in the instance_types table isn't auto-increment. We've selected high numbers to avoid collisions, but the community should discuss how flavorid behaves and the best approach for adding future new instance types.
A second issue is that currently gVIrtus requires a virtual serial port for VM<->host initialization. This requires us to use the serial port that is otherwise used by the Ajax term. A consequence is that VMs using the GPUs currently cannot start an Ajax console.
BoF agenda and discussion
Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.