Jump to: navigation, search

Difference between revisions of "SchedulerRaceReduction"

m (Text replace - "__NOTOC__" to "")
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOTOC__
+
 
 
<!-- ##master-page:[[ProposalTemplate]] -->
 
<!-- ##master-page:[[ProposalTemplate]] -->
 
<!-- #format wiki -->
 
<!-- #format wiki -->
Line 5: Line 5:
  
 
== [[SchedulerRaceReduction]] ==
 
== [[SchedulerRaceReduction]] ==
'''Time: ''' <<[[DateTime]](2012-06-11T20:42:43Z)>>
 
  
 
'''Drafter: '''[[belliott]]
 
'''Drafter: '''[[belliott]]
Line 11: Line 10:
 
== Overview ==
 
== Overview ==
  
The scheduler is subject to a race condition which can cause it to incorrectly identify available resources on a particular compute host. The problem occurs if multiple scheduler instances/threads concurrently issue an instance build request (i.e. run_instance) to the same compute host. This situation may oversubscribe the given compute host and cause one or more run_instance requests to fail.
+
The scheduler is subject to a race condition which can cause it to incorrectly identify available resources on a particular compute host. The problem occurs if multiple scheduler instances/threads concurrently issue an instance build request (i.e. ''run_instance'') to the same compute host. This situation may oversubscribe the given compute host and cause one or more ''run_instance'' requests to fail.
 +
 
 +
== Example ==
 +
 
 +
Compute host ''C'' has 3 GB of ram free.
 +
 
 +
# Scheduler ''A'' sends a ''run_instance'' request ''R1'' to ''C'' trying to build a 2GB instance.
 +
# Scheduler ''B'' sends a ''run_instance'' request ''R2'' to ''C'' trying to build a 2GB instance.
 +
# Assume processing of ''R1'' and ''R2'' begins concurrently on ''C''.
 +
 
 +
Obviously ''C'' cannot handle both requests, so at least 1 will fail.
 +
 
 +
== Impact ==
 +
 
 +
Instance build requests may fail, even if other compute hosts are available with free resources.
 +
 
 +
== Solution ==
 +
 
 +
* Compute hosts should have the final say over whether a ''run_instance'' request can be properly serviced. To this end, the compute host must be capable of identify whether it has free resources when a new ''run_instance'' request arrives.
 +
* Compute hosts should serially verify resources available for ''run_instance'' requests to avoid concurrent competition by multiple callers.
 +
* Schedulers should read the response to ''run_instance'' and possibly retry the request at a different compute host.  
  
 
[https://blueprints.launchpad.net/nova/+spec/scheduler-resource-race Blueprint]
 
[https://blueprints.launchpad.net/nova/+spec/scheduler-resource-race Blueprint]
 +
 +
[https://bugs.launchpad.net/nova/+bug/1011852 Launchpad bug]

Latest revision as of 23:30, 17 February 2013


SchedulerRaceReduction

Drafter: belliott

Overview

The scheduler is subject to a race condition which can cause it to incorrectly identify available resources on a particular compute host. The problem occurs if multiple scheduler instances/threads concurrently issue an instance build request (i.e. run_instance) to the same compute host. This situation may oversubscribe the given compute host and cause one or more run_instance requests to fail.

Example

Compute host C has 3 GB of ram free.

  1. Scheduler A sends a run_instance request R1 to C trying to build a 2GB instance.
  2. Scheduler B sends a run_instance request R2 to C trying to build a 2GB instance.
  3. Assume processing of R1 and R2 begins concurrently on C.

Obviously C cannot handle both requests, so at least 1 will fail.

Impact

Instance build requests may fail, even if other compute hosts are available with free resources.

Solution

  • Compute hosts should have the final say over whether a run_instance request can be properly serviced. To this end, the compute host must be capable of identify whether it has free resources when a new run_instance request arrives.
  • Compute hosts should serially verify resources available for run_instance requests to avoid concurrent competition by multiple callers.
  • Schedulers should read the response to run_instance and possibly retry the request at a different compute host.

Blueprint

Launchpad bug