Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport timeout request and balancing during callro #486

Merged
merged 3 commits into from
Aug 16, 2024

Commits on Aug 16, 2024

  1. router: introduce request_timeout for read-only calls

    New option for call, callro, callre, callbro, callbre: `request_timeout`.
    Note, that for callrw (and call with 'write' mode) this option is NoOp.
    
    The option allows to specify the maximum number of seconds single request
    in call takes. By default it's equal to the `timeout` option. Must be
    always <= timeout.
    
    Router's read-only requests retry, when recoverable error happens (e.g.
    WRONG_BUCKET or timeout). By default in case of timeout error vshard's
    call will exit with error, as request_timeout = timeout. However, if
    request_timeout < timeout, then several requests will be done in scope
    of one call. These requests are balanced between replicas, we won't go
    to the not responding one in the scope of the single request. Balancing
    is introduced in the following commit.
    
    May be useful for mission critical systems, where the number of failed
    requests must be minimized.
    
    Part of tarantool#484
    
    NO_DOC=<ticket was already created>
    Serpentian committed Aug 16, 2024
    Configuration menu
    Copy the full SHA
    5cee8a1 View commit details
    Browse the repository at this point in the history
  2. replicaset: introduce stateless balancing for read-only requests

    This commit introduces balancing for requests, where balance mode is not
    set. The motivation for this change is that we should make our best to
    succeed with call, which is important for mission critical systems.
    
    When balance mode is set, replicas are balanced between requests,
    consequent calls won't go to the same replica. Balancing is done
    according to the round-robin strategy, weights doesn't affect such
    balancing.
    
    However, when balance is false, callro and callre are still balanced
    now. But this balancing happens only in the scope of the call,
    consequent calls go to the same, most prioritized replica firstly.
    Retry happens, when network error happens (e.g. timeout).
    
    During such balancing callro method doesn't distrinuish master and
    replica. If master has the highest priority according to config, then
    callro will go to master firstly, and only after that to replica. callre
    method, on the other hand, will firstly try to access all replicas, if
    all requests fail, then we fallback to master.
    
    Closes tarantool#484
    
    NO_DOC=<ticket was already created>
    Serpentian committed Aug 16, 2024
    Configuration menu
    Copy the full SHA
    b39bd6f View commit details
    Browse the repository at this point in the history
  3. router: calls affect temporary prioritized replica

    Previously prioritized replica was changed only if it was disconnected
    for FAILOVER_DOWN_TIMEOUT seconds. However, if connection is shows as
    'connected' it doesn't mean, that this connection actually works. The
    connection must be pingable in order to be operational.
    
    This commit makes failover temporary lower replica's priority if
    FAILOVER_DOWN_SEQUENTIAL_FAIL requests fail to it. All vshard internal
    requests (including failover ping) and all user calls affect the number
    of sequentially failed requests. Note, that we consider request
    failed, when net.box connection is not operational (cannot make
    conn.call, e.g. connection is not yet established or timeout is
    reached), user functions throwing errors won't affect prioritized
    replica.
    
    The behavior of failover is the following after this commit:
    
    1. Failover pings all prioritized replicas. If ping doesn't succeed, the
       connection is recreated, which is needed, if user returns too big
       values from the functions, in such case no other request can be done
       until this value is returned. Failed ping affects the number of
       sequentially failed requests.
    
    2. If connection is down for >= than FAILOVER_DOWN_TIMEOUT or if the
       number of sequentially failed requests is >=
       FAILOVER_DOWN_SEQUENTIAL_FAIL, than we take replica with lower
       priority as the main one.
    
    3. If failover didn't try to use the more prioritized replica (according
       to weights) for more than FAILOVER_UP_TIMEOUT, then we try to set a
       new replica as the prioritized one. Note, that we don't set it, if
       ping to it didn't succeed during ping round in (1).
    
    Closes tarantool#483
    
    NO_DOC=bugfix
    Serpentian committed Aug 16, 2024
    Configuration menu
    Copy the full SHA
    b36e909 View commit details
    Browse the repository at this point in the history