Backport timeout request and balancing during callro #486

New option for call, callro, callre, callbro, callbre: `request_timeout`. Note, that for callrw (and call with 'write' mode) this option is NoOp. The option allows to specify the maximum number of seconds single request in call takes. By default it's equal to the `timeout` option. Must be always <= timeout. Router's read-only requests retry, when recoverable error happens (e.g. WRONG_BUCKET or timeout). By default in case of timeout error vshard's call will exit with error, as request_timeout = timeout. However, if request_timeout < timeout, then several requests will be done in scope of one call. These requests are balanced between replicas, we won't go to the not responding one in the scope of the single request. Balancing is introduced in the following commit. May be useful for mission critical systems, where the number of failed requests must be minimized. Part of tarantool#484 NO_DOC=<ticket was already created>

This commit introduces balancing for requests, where balance mode is not set. The motivation for this change is that we should make our best to succeed with call, which is important for mission critical systems. When balance mode is set, replicas are balanced between requests, consequent calls won't go to the same replica. Balancing is done according to the round-robin strategy, weights doesn't affect such balancing. However, when balance is false, callro and callre are still balanced now. But this balancing happens only in the scope of the call, consequent calls go to the same, most prioritized replica firstly. Retry happens, when network error happens (e.g. timeout). During such balancing callro method doesn't distrinuish master and replica. If master has the highest priority according to config, then callro will go to master firstly, and only after that to replica. callre method, on the other hand, will firstly try to access all replicas, if all requests fail, then we fallback to master. Closes tarantool#484 NO_DOC=<ticket was already created>

Previously prioritized replica was changed only if it was disconnected for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as 'connected' it doesn't mean, that this connection actually works. The connection must be pingable in order to be operational. This commit makes failover temporary lower replica's priority if FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal requests (including failover ping) and all user calls affect the number of sequentially failed requests. Note, that we consider request failed, when net.box connection is not operational (cannot make conn.call, e.g. connection is not yet established or timeout is reached), user functions throwing errors won't affect prioritized replica. The behavior of failover is the following after this commit: 1. Failover pings all prioritized replicas. If ping doesn't succeed, the connection is recreated, which is needed, if user returns too big values from the functions, in such case no other request can be done until this value is returned. Failed ping affects the number of sequentially failed requests. 2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the number of sequentially failed requests is >= FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower priority as the main one. 3. If failover didn't try to use the more prioritized replica (according to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a new replica as the prioritized one. Note, that we don't set it, if ping to it didn't succeed during ping round in (1). Closes tarantool#483 NO_DOC=bugfix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport timeout request and balancing during callro #486

Backport timeout request and balancing during callro #486

Commits on Aug 16, 2024