slug | id | title | date | comments | tags | description | references | |
---|---|---|---|---|---|---|---|---|
43-how-to-design-robust-and-predictable-apis-with-idempotency |
43-how-to-design-robust-and-predictable-apis-with-idempotency |
How to design robust and predictable APIs with idempotency? |
2018-09-12 12:55 |
true |
|
APIs can be un-robust and un-predictable. To solve the problem, three principles should be observed. The client retries to ensure consistency. Retry with idempotency, exponential backoff, and random jitter. |
How could APIs be un-robust and un-predictable?
- Networks are unreliable.
- Servers are more reliable but may still fail.
How to solve the problem? 3 Principles:
-
Client retries to ensure consistency.
-
Retry with idempotency and idempotency keys to allow clients to pass a unique value.
- In RESTful APIs, the PUT and DELETE verbs are idempotent.
- However, POST may cause ==“double-charge” problem in payment==. So we use a ==idempotency key== to identify the request.
- If the failure happens before the server, then there is a retry, and the server will see it for the first time, and process it normally.
- If the failure happens in the server, then ACID database will guarantee the transaction by the idempotency key.
- If the failure happens after the server’s reply, then client retries, and the server simply replies with a cached result of the successful operation.
-
Retry with ==exponential backoff and random jitter==. Be considerate of the ==thundering herd problem== that servers that may be stuck in a degraded state and a burst of retries may further hurt the system.
For example, Stripe’s client retry calculates the delay like this...
def self.sleep_time(retry_count)
# Apply exponential backoff with initial_network_retry_delay on the
# number of attempts so far as inputs. Do not allow the number to exceed
# max_network_retry_delay.
sleep_seconds = [Stripe.initial_network_retry_delay * (2 ** (retry_count - 1)), Stripe.max_network_retry_delay].min
# Apply some jitter by randomizing the value in the range of (sleep_seconds
# / 2) to (sleep_seconds).
sleep_seconds = sleep_seconds * (0.5 * (1 + rand()))
# But never sleep less than the base sleep seconds.
sleep_seconds = [Stripe.initial_network_retry_delay, sleep_seconds].max
sleep_seconds
end