Difference between revisions of "SchedulerRaceReduction"

Revision as of 21:50, 11 June 2012

SchedulerRaceReduction

Time: <<DateTime(2012-06-11T20:42:43Z)>>

Drafter: belliott

Overview

The scheduler is subject to a race condition which can cause it to incorrectly identify available resources on a particular compute host. The problem occurs if multiple scheduler instances/threads concurrently issue an instance build request (i.e. run_instance) to the same compute host. This situation may oversubscribe the given compute host and cause one or more run_instance requests to fail.

Example

Compute host C has 3 GB of ram free.

Scheduler A sends a run_instance request R1 to C trying to build a 2GB instance.
Scheduler B sends a run_instance request R2 to C trying to build a 2GB instance.
Assume processing of R1 and R2 begins concurrently on C.

Obviously C cannot handle both requests, so at least 1 will fail.

Impact

Instance build requests may fail, even if other compute hosts are available with free resources.

Solution

Compute hosts should have the final say over whether a run_instance request can be properly serviced. To this end, the compute host must be capable of identify whether it has free resources when a new run_instance request arrives.
Compute hosts should serially verify resources available for run_instance requests to avoid concurrent competition by multiple callers.
Schedulers should read the response to run_instance and possibly retry the request at a different compute host.

Blueprint

Launchpad bug

@@ Line 17: / Line 17: @@
 Compute host ''C'' has 3 GB of ram free.
-Scheduler ''A'' sends a ''run_instance'' request ''R1'' to ''C'' trying to build a 2GB instance.
+# Scheduler ''A'' sends a ''run_instance'' request ''R1'' to ''C'' trying to build a 2GB instance.
-Scheduler ''B'' sends a ''run_instance'' request ''R2'' to ''C'' trying to build a 2GB instance.
+# Scheduler ''B'' sends a ''run_instance'' request ''R2'' to ''C'' trying to build a 2GB instance.
+# Assume processing of ''R1'' and ''R2'' begins concurrently on ''C''.
-Assuming processing of ''R1'' and ''R2'' begins concurrently on ''C''.  Obviously ''C'' cannot handle both requests, so at least 1 will fail.
+Obviously ''C'' cannot handle both requests, so at least 1 will fail.
 == Impact ==