Jump to: navigation, search

Heat/Blueprints/ImproveAPIPolling

< Heat
Revision as of 20:39, 8 January 2014 by Jason Dunsmore (talk | contribs) (New TaskRunner arguments)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Blueprint https://blueprints.launchpad.net/heat/+spec/improve-api-polling

Heat polls a resource's API (eg. Nova, Trove) every second during resource create/update/delete to check the status of the resource. Several concurrent stack-create/update/deletes can cause a high number of requests to the API and trigger the API's rate-limiting, which results in a stack failure. The polling intervals should be spaced more appropriately to avoid excessive API requests.

Given that we don't know anything about how long a device might take to become ACTIVE, it might be reasonable to poll the API every second in the beginning. However, after a device has been pending create/update/delete for over 5-10 minutes, it no longer makes sense to poll the API every second. The probability of a device becoming active in any given second over the course of 9 minutes might look something like:

Device-active-probabilities.png

The goal here is to maintain a relatively constant probability of the device status being ACTIVE upon each poll, so the polling interval will be calculated based on a geometric progression. At some point, the interval will be long enough so that we no longer have to worry about excessive API requests. An "interval maximum" can be provided to indicate the point at which the interval should not increase further.

In each of the following graphs, the x-axis represents the Nth iteration of the TaskManager job in the check_*_complete methods.

Current API polling

Currently, we make ~600 polls for a 10-minute resource creation.

Intervals Current polling intervals

Running sum Current polling running sum

Proposed API polling

The number of polls can be reduced by an order of magnitude with the proposed changes.

Intervals Proposed polling intervals

Running sum Proposed polling running sum

~50 polls for a 10-minute resource creation.

New TaskRunner arguments

Arguments that can be passed to TaskRunner.__call__() include:

  • Starting interval (default 1): The initial interval in seconds. Can use wait_time.
  • Interval increase factor (default 1.1): Each sequential interval will be the product of this value and the previous interval. Can set this to 1 to return to old behavior.
  • Interval maximum (default 20): The maximum interval value in seconds.

The graphs above used the default options.

Other possible defaults

Interval increase factor 1.05

Interval-increase-factor-1 05-intervals.png Interval-increase-factor-1 05-sum.png

Interval increase factor 1.2

An interval increase factor >=1.2 is probably too aggressive.

Interval-increase-factor-1 2-intervals.png Interval-increase-factor-1 2-sum.png