Jump to: navigation, search

Difference between revisions of "Persistent resource claim"

(Implementation Plan)
(Integration With Server Operation Flow)
Line 48: Line 48:
 
===== Integration With Server Operation Flow =====
 
===== Integration With Server Operation Flow =====
  
Claim creation()/destroy() should happen not only when server create/destroy, also when instance's resource requirement changed like resize or compute node changed like migration.
+
Claim creation()/destroy() should happen not only when server create/destroy, also when instance's resource requirement changed like resize or compute node changed like migration. Below is detailed analysis of the changes to difference server actions.
  
Rebuild:
+
*Build
    No need for resource claim since no resource/host changes.
+
We don't need the context manager for the claim since it's persistent. Instead it's treated as other resources like network/bdm when something wrong. Thus we can set the instance host/node in compute manager, instead of compute resource tracker.
 +
 
 +
*Rebuild:
 +
No need for resource claim since no resource/host changes with this operation.
 +
 
 +
*
  
 
===Benefit===
 
===Benefit===

Revision as of 23:55, 20 March 2014

Summary

When a resource claim happens through instance_claim() or resize_claim() request, the compute resource tracker will get the COMPUTE_RESOURCE_SEMAPHORE, allocate the resource, persistent the allocation result, or claim, in the DB, then return the ID of the claims.

When a resource is put back to compute resource tracker through drop_resize_claim() or abort_instance_claim(), the claim (or simply claim ID) will be passed into the resource tracker, to free the resource.

See below for the benefit of this changes.

Implementation Plan

Data Structure

  • claim object:
  • Bulleted list item
  • uuid: ID of the claim
  • host: the host that support this claims
  • owner_type: The utility type of the resource claim owner, support 'server' only now..
  • owner_id: The ID of the resource claim owner. Support only insance_uuid now.
  • vcpu: the vcpu number
  • memory: the MB of memory claimed
  • disk: the GB of disk claimed.
  • pci: A json blob of the PCI devices claimed (Nullable)
  • extra_resource: A json blob for other resource types (Nullable)
  • tag: A tag provided by the caller and will be utilized/interpreted by the caller.

DB

  • A claims table will be created to track all claims.
It's an option to keep the claim information in the instance table (like through system metadata), but I'd prefer a separated table.

API

claim operation
  • get_claims_for_instance(instance_uuid): return the claims for instance.
  • get_claims_for_host(host): return the claims for the host.
Changes to resource tracker API
  • instance_claim/resize_claim:
Add a step to sync the claim information to DB
Consideration: Does it matter of the sequence of sync claims and sync compute node? Seems should not matter because we are in lock.
  • update_available_resource:
This function may changes a lot. Resource tracker get resource information from hypervisor, and then get all claims from DB, and update the available resource by deduct the claims from the resource information from hypervisor, and deduct the information from orphan insance.
Question: Should we create the claims in DB for the orphan instance? I assume should not.
  • drop_resize_claim()/abort_instance_claim():
Add a step to remove the claim information from DB.


Integration With Server Operation Flow

Claim creation()/destroy() should happen not only when server create/destroy, also when instance's resource requirement changed like resize or compute node changed like migration. Below is detailed analysis of the changes to difference server actions.

  • Build

We don't need the context manager for the claim since it's persistent. Instead it's treated as other resources like network/bdm when something wrong. Thus we can set the instance host/node in compute manager, instead of compute resource tracker.

  • Rebuild:

No need for resource claim since no resource/host changes with this operation.

Benefit

The key benefit of this changes is, the update_available_resource() update the resource usage information according to the claims in DB, instead of the instance state, thus the caller of resource tracker API (instance_claim/resize_claim/drop_claim etc) can allocate the resource, then change the instance state, and will have no race condition with the update_available_resource() anymore.

This benefit enables several potential changes:

  • We can move the migration object creation out of the resource tracker, by having the conductor to claim the resource in source/target host and then create the migration object.
  • We can move the _set_instance_host_and_node out of the resource tracker, by having the conductor to claim the resource and then set the host node.
  • After move migration objection creation out of the resource tracker, resource tracker don't need understand the migration process anymore and we can combine the instance_claim/resize_claim.
  • Now we can support resource claim through rpcapi. The caller send rpcapi to the compute node, and the compute node will invocate the resource tracker, get the COMPUTE_RESOURCE_SEMAPHORE, allocate the resource, persistent the claims, and then return back the id of the claim (or the whole claim object). With this changes, we can move the prep_resize etc out of compute manager to conductor.
  • Also with this changes, it will be easy to check the resource usage for all instance, all host, etc, even including the overhead.

Potential Issue

  • We will have one more DB access for each instance claim in the compute node, not sure if performance impact.
  • One more table is created.

Upgrade/Compatibility

This change itself is compute node locally and will have no upgrade/compatibility issue. In next step when we move the migration object/_set_instance_host_and_node out of resource tracker, it may have impact, but that should be discussed at that time.