Heat polls a resource's API (eg. Nova, Trove) every second during resource create/update/delete to check the status of the resource. Several concurrent stack-create/update/deletes can cause a high number of requests to the API and trigger the API's rate-limiting, which results in a stack failure. The polling intervals should be spaced more appropriately to avoid excessive API requests.
Given that we don't know anything about how long a device might take to become ACTIVE, it might be reasonable to poll the API every second in the beginning. However, after a device has been pending create/update/delete for over 5-10 minutes, it no longer makes sense to poll the API every second. The probability of a device becoming active in any given second over the course of 9 minutes might look something like:
The goal here is to maintain a relatively constant probability of the device status being ACTIVE upon each poll, so the polling interval will be calculated based on a geometric progression. At some point, the interval will be long enough so that we no longer have to worry about excessive API requests. An "interval maximum" can be provided to indicate the point at which the interval should not increase further.
In each of the following graphs, the x-axis represents the Nth iteration of the TaskManager job in the check_*_complete methods.
Current API polling
Currently, we make ~600 polls for a 10-minute resource creation.
Proposed API polling
The number of polls can be reduced by an order of magnitude with the proposed changes.
~50 polls for a 10-minute resource creation.
New TaskRunner arguments
Arguments that can be passed to TaskRunner.__call__() include:
- Starting interval (default 1): The initial interval in seconds. Can use wait_time.
- Interval increase factor (default 1.1): Each sequential interval will be the product of this value and the previous interval. Can set this to 1 to return to old behavior.
- Interval maximum (default 20): The maximum interval value in seconds.
The graphs above used the default options.
Other possible defaults
Interval increase factor 1.05
Interval increase factor 1.2
An interval increase factor >=1.2 is probably too aggressive.