Soft retry limits - AAE fullsync #772

martinsumner · 2017-10-18T11:49:00Z

When performing AAE full-sync, if an elected vnode to co-ordinate with is doing a tree rebuild (and given multi-hour reload times this would seem to be a common event), the co-ordination attempt will soft exit.

The soft exit is captured and handled here:

https://github.com/basho/riak_repl/blob/2.1.8/src/riak_repl2_fscoordinator.erl#L537-L554

This checks against a retry limit, and presumably prompts a retry if the limit is not reached. The default soft retry limit is set to infinity.

In production we see very rapid retries, which escalates the sys process count, and ultimately impacts stability of the cluster.

Attempts are underway to control this behaviour by setting a max retry limit. However the default setting appears to be unsafe and should probably be changed. Perhaps also there should be a wait between retries. Perhaps also there is an underlying issue causing the process count to stack up with the reties.

russelldb · 2018-03-05T14:00:06Z

@martinsumner just to be clear, the process growth is on the source, or the sink?

martinsumner · 2018-03-08T14:33:23Z

As discussed in side-channel. The aae full-sync code is difficult to trace through, and the experience of production is that testing has so far been deficient, and also the solution is so brittle as to infer there can only be a small handful of production users (perhaps only one).

Intentions for 3.0 are to look to replace with something altogether simpler, so extensive effort to unpick code and improve test coverage for 2.2.5 doesn't make sense at this stage.

So if we can improve this behaviour for soft exits only, to reduce the chance of crashes, that will be good enough. There's no need to invest significant effort in refactoring code or test to move this forward for the long-term.

Sticking plaster fix for #772

martinsumner changed the title ~~Soft retry limits in AAE full-sync~~ Soft retry limits - AAE fullsync Oct 23, 2017

russelldb mentioned this issue Mar 15, 2018

Sticking plaster fix for basho/riak_repl#772 #779

Merged

russelldb added a commit that referenced this issue Mar 22, 2018

Merge pull request #779 from basho/bug/rdb/gh772

f6e650e

Sticking plaster fix for #772

martinsumner mentioned this issue Oct 7, 2019

Failure to complete full-sync on hitting soft retry limit #799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft retry limits - AAE fullsync #772

Soft retry limits - AAE fullsync #772

martinsumner commented Oct 18, 2017

russelldb commented Mar 5, 2018

martinsumner commented Mar 8, 2018

Soft retry limits - AAE fullsync #772

Soft retry limits - AAE fullsync #772

Comments

martinsumner commented Oct 18, 2017

russelldb commented Mar 5, 2018

martinsumner commented Mar 8, 2018