Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport timeout request and balancing during callro #486

Merged
merged 3 commits into from
Aug 16, 2024

Conversation

Serpentian
Copy link
Contributor

No description provided.

New option for call, callro, callre, callbro, callbre: `request_timeout`.
Note, that for callrw (and call with 'write' mode) this option is NoOp.

The option allows to specify the maximum number of seconds single request
in call takes. By default it's equal to the `timeout` option. Must be
always <= timeout.

Router's read-only requests retry, when recoverable error happens (e.g.
WRONG_BUCKET or timeout). By default in case of timeout error vshard's
call will exit with error, as request_timeout = timeout. However, if
request_timeout < timeout, then several requests will be done in scope
of one call. These requests are balanced between replicas, we won't go
to the not responding one in the scope of the single request. Balancing
is introduced in the following commit.

May be useful for mission critical systems, where the number of failed
requests must be minimized.

Part of tarantool#484

NO_DOC=<ticket was already created>
This commit introduces balancing for requests, where balance mode is not
set. The motivation for this change is that we should make our best to
succeed with call, which is important for mission critical systems.

When balance mode is set, replicas are balanced between requests,
consequent calls won't go to the same replica. Balancing is done
according to the round-robin strategy, weights doesn't affect such
balancing.

However, when balance is false, callro and callre are still balanced
now. But this balancing happens only in the scope of the call,
consequent calls go to the same, most prioritized replica firstly.
Retry happens, when network error happens (e.g. timeout).

During such balancing callro method doesn't distrinuish master and
replica. If master has the highest priority according to config, then
callro will go to master firstly, and only after that to replica. callre
method, on the other hand, will firstly try to access all replicas, if
all requests fail, then we fallback to master.

Closes tarantool#484

NO_DOC=<ticket was already created>
Previously prioritized replica was changed only if it was disconnected
for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as
'connected' it doesn't mean, that this connection actually works. The
connection must be pingable in order to be operational.

This commit makes failover temporary lower replica's priority if
FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal
requests (including failover ping) and all user calls affect the number
of sequentially failed requests. Note, that we consider request
failed, when net.box connection is not operational (cannot make
conn.call, e.g. connection is not yet established or timeout is
reached), user functions throwing errors won't affect prioritized
replica.

The behavior of failover is the following after this commit:

1. Failover pings all prioritized replicas. If ping doesn't succeed, the
   connection is recreated, which is needed, if user returns too big
   values from the functions, in such case no other request can be done
   until this value is returned. Failed ping affects the number of
   sequentially failed requests.

2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the
   number of sequentially failed requests is >=
   FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower
   priority as the main one.

3. If failover didn't try to use the more prioritized replica (according
   to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a
   new replica as the prioritized one. Note, that we don't set it, if
   ping to it didn't succeed during ping round in (1).

Closes tarantool#483

NO_DOC=bugfix
@Serpentian Serpentian merged commit 9fc976d into tarantool:master Aug 16, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant