-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handoff and TCP RECV timeout #994
Comments
After an initial edit of the code, realised there already existed the helpful https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L141 This appears to be more useful than setting the timer. If we want to ensure that the receiver is up to date with the sender, then setting this to a much lower value may be easier. Using a handoff_acksync_threshold of 1 will cause a sync of every batch (as discussed in the issue), but the sync will hold up the fold producing the next batch. It would be better to wait until we're at the point of sending a new batch to test to see if the sync is working. That way folds can continue in parallel to results being processed by the receiver, but never get ahead. Whatever we do should support those that have already configured |
After further review, the meaning of To achieve this the This is will required in any change. |
Summary of Confusion/Problem:
Handoffs on busy riak_kv clusters often fail, with the reported issue being TCP RECV timeouts.
The code of handoff includes references to support very old versions:
https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L129-L133
Also there is reference to a re-write to resolve confusion happening soon:
https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L198-L202
Handoffs timing out are disruptive, as the handoff must start from the beginning (in terms of the fold over the vnode), and re-send all of the data at the next attempt.
When leaving/joining this can be controlled through handoff concurrency, but this isn't perfect, and still can lead to continuous failures.
Hinted handoffs can be even more problematic, particular when receiving vnodes are already subject to high loads. There is very little tuning/testing that can be done to predict what a safe transfer limit might be. There is also a vicious circle that can form - as if hinted handoffs fail, the receiving cluster may be subject to increasing amounts of anti-entropy and read-repair load that will cause hinted handoffs to continue to fail.
This issue is exacerbated that the timeout's reported as TCP RECV timeouts are not necessarily low-level network timeouts, but are more likely as a result of the receiver failing to respond to a OSI L7 SYNC message. There is also a confusion of timeout settings:
Calling the function
get_handoff_receive_timeout/0
returns the handoff_timeout, which defaults to the TCP_TIMEOUT:https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L514-L515
There is also a different
handoff_receive timeout
on the sender, based on a receiver setting:https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L293_L298
Where the default is to be a third of the receiver setting, if set, but by default it is not set, and does not default to a third of the default chosen when unset on the receiver side:
https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_receiver.erl#L78
How are these timeouts to be tuned to prevent timeouts? A lower handoff_receive_timeout on the sender side, will lead to more frequent SYNCs - and hence make a backlog leading to a timeout less likely; but have different consequences within the receiver.
Proposal:
There are some broad changes proposed:
Remove all legacy configuration messages, and legacy options (e.g. non-batched sending).
Less haste even if that means less speed - is the cost of a round-trip worth paying every batch if it prevents time outs? i.e. don't send a batch until you have received a sync to confirm the last batch is processed. Completing a transfer first time will take less time than running faster and having to repeat a transfer. Speed is dependent on coordination.
Make timeout configurations easier to reason about;
Log process in transfers, so proactive running of
riak-admin transfers
is not required to see progress, or to look-back on progress post-event.Potentially compress batches before sending (i.e. using term_to_binary with the
compressed
option for compatibility).The aim must be to make this changes without adding another generation of forward/backward compatibility checking. Avoid making any change that requires version checking.
Related:
There was some debate, some time ago, about the wisdom of performing read-repairs to fallbacks. There was intention at some stage to add a configuration option to read repair to primaries only. This is related to hinted handoff performance - as the volume of changes which are handed off during a hinted handoff is adversely impacted by read-repairs. i.e. is a node recovers with data in tact, it will receive in handoff not just the data missed since it went down, but all the data read since it went down.
There is some similarity between this issue and basho/riak_repl#817. There may be factor here in changes to gen_tcp within OTP, especially with regards to timeout defaults. Using the
{active, once}
setting and allowing for large backlogs to form can lead to undexpected and hard-to-troubleshoot failures. It is better to avoid the consequences of backlogs buffering in TCP receive windows.The text was updated successfully, but these errors were encountered: