Support-retry-with-idempotency

Background
Currently, Heat doesn't retry API calls when creating/updating/deleting a stack. In case of API request failure, Heat would change the stack status to "XXX_FAILURE". (Or start rollback process if the rollback flag was true.)

However, I think there are some circumstances/scenarios where API retry is appropriate. (i.e. 503 response or timeout due to server failover)

I believe, API retry function can improve the reliability of Heat's task-processing. Note: Providing a retry capability for the "HEAT API" is out of scope for this proposal.

Our definition of API retry
Out definition of API retry means “Retry for Single API request”. Like this:



“Retry included multiple API request” is out of our discussion. Like this:



The necessity of API retry
We think end-user impact must be reduce maximally. API failure that caused by temporary system problem can avoid end-user impact by using API retry.




 * 1) Stack creation failed because of temporary system problem (e.g. failover, temporary overload).
 * 2) Do retry few times (max_attempts can define in config).
 * 3) If retry over happened, Heat would change stack status to CREATE_FAILED, it is same of currently state transition.

Heat can avoid sending ERROR response to end-user if temporary system problem was recovered before API retry-over.

The necessity of idempotency
API retry needs idempotency. Do API retry without idempotency may create duplicated resources. Currently, there is no way to cope this situation.



Do API retry with idempotency will solve the situation. If Heat added ClientToken(IdempotencyToken) to request header, Nova doesn't create duplicated instance.



API retry + Idempotency would be appropriate for API retry-processing.

Retry Policy
We think there are two policies about API retry in Heat.


 * Retry policy for HTTP methods
 * Retry policy for HTTP responses

HTTP methods
Retry-policy should be defined per each method.

for POST methods
We propose HEAT to support ClientToken when retrying POST method.
 * 1) POST request from Heat to Nova(or others), but Heat couldn't get a response for some reason
 * 2) Actually, Nova(or others) has received the request and created a resource
 * 3) But, Heat doesn't know the resource id to check the status.
 * 4) Retry POST request with ClientToken until receiving a response or it reaches a retry limit. If it reached the retry limit, make stack status CREATE_FAILED or start rollback process.

for PUT methods
We believe, PUT methods naturally have idempotency so Heat can API retry safely. ClientTokens doesn't need to be used for PUT-retry.
 * 1) PUT request from Heat to Nova(or others), but Heat couldn't get a response for some reason.
 * 2) Actually, Nova(or others) has received the request and updated the resource
 * 3) Heat doesn't know the result. But Heat already knows the resource id(when it created it)
 * 4) Retrying PUT request would result in the same status as 2). Thus retry the request until either receiving a response or it reaches a retry limit. If it reaches a retry limit, make stack status CREATE_FAILED or start rollback process.

for DELETE methods
DELETE methods are not idempotent. However, we can retry DELETE method anyway and see the response to know what is happend in the previous request. ClientTokens doesn't need to be used for DELETE-retry.
 * 1) DELETE request from Heat to Nova(or others), but Heat couldn't get a response for some reason.
 * 2) Actually, Nova(or others) has received the request and deleted the resource
 * 3) Heat doesn't know the result. But Heat already knows the resource id(when it created it)
 * 4) Retrying DELETE request would get either of the following response which would result in the same status(deleted).
 * response 20x(almost 204) -> delete action success (deleted)
 * response 404 -> delete action failed but already deleted (deleted)
 * Heat can retry DELETE requests until it gets 2xx or 404 response or it reaches a retry limit. If it reaches a retry limit, make stack status DELETE_FAILED.

for GET methods
Same as PUT method.

HTTP responses
Retry-policy should also be defined per HTTP responses.

got HTTP response 2xx
No Problem. API retry is not necessary.

got HTTP response 4xx (ClientError)
API retry is not appropriate in this case.
 * Heat knows that the resource was not created.
 * The error is not transient.
 * The request will never succeed in this case.

got HTTP response 5xx (ServerError)
API retry may solve the problem.
 * Heat knows that the resource was not created.
 * The error may transient in this case.

couldn't get HTTP response
Two different circumstances exist in this case.
 * HTTP request was lost
 * The resource was not created.
 * HTTP request accepted but HTTP response was lost
 * The resource may or may not exist.
 * The error may transient in this case. This situation may occur by network switch/server failover or temporary overload.
 * API retry may work in these situations.

Heat doesn't know whether the resource exists or not. Therefore, idempotency for API receiver's side (i.e. Nova or other modules) is necessary for "safe API retrying" in this case.

Parameters
Heat already has "Timeout" parameter, we don't need to add a new parameter for this. We want to add the following parameters:
 * max_attempts (time)
 * retry_interval (seconds)

[Note] The above might be wrong. "Timeout" parameter in Heat is "Timeout for Stack Creation" and not "Timeout for API call". We need "Timeout for API response waiting", which doesn't exist right now. It is necessary for Heat to handle retries.

Configuration
"max_attempts" and "retry_interval" should be system wide parameters. The value can be defined that is based on the system architecture and environment(e.g. estimated duration of server failovers). On the other hand, the time required to create a resource varies by its type and size. "max_attempts" and "retry_interval" should also be configurable per resources. We propose the parameters to be configurable as follows:
 * Global parameter max_attempts and retry_interval in heat.conf (mandatory)
 * max_attempts and retry_interval can set per each resource in heat.conf (optional)
 * If optional parameters are defined, Heat would use optional parameter
 * max_attempts and retry_interval cannot be indicated in templates or API request parameters.

Plan to implement
We are going to start implementation after idempotency has been implemented. The necessity of idempotency is under the discussion in Nova project.

Other informations

 * Blueprint of this proposal: https://blueprints.launchpad.net/heat/+spec/support-retry-with-idempotency
 * Deep related blueprint: https://blueprints.launchpad.net/nova/+spec/idempotentcy-client-token
 * IceHouse summit discussion: https://etherpad.openstack.org/p/icehouse-summit-heat-convergence
 * (Deprecated) Implementation detail etherpad: https://etherpad.openstack.org/p/kgpc00uuQr