Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft retry limits - AAE fullsync #772

Open
martinsumner opened this issue Oct 18, 2017 · 2 comments
Open

Soft retry limits - AAE fullsync #772

martinsumner opened this issue Oct 18, 2017 · 2 comments

Comments

@martinsumner
Copy link
Contributor

When performing AAE full-sync, if an elected vnode to co-ordinate with is doing a tree rebuild (and given multi-hour reload times this would seem to be a common event), the co-ordination attempt will soft exit.

The soft exit is captured and handled here:

https://github.com/basho/riak_repl/blob/2.1.8/src/riak_repl2_fscoordinator.erl#L537-L554

This checks against a retry limit, and presumably prompts a retry if the limit is not reached. The default soft retry limit is set to infinity.

In production we see very rapid retries, which escalates the sys process count, and ultimately impacts stability of the cluster.

Attempts are underway to control this behaviour by setting a max retry limit. However the default setting appears to be unsafe and should probably be changed. Perhaps also there should be a wait between retries. Perhaps also there is an underlying issue causing the process count to stack up with the reties.

sys_process_count

retry_logs

@martinsumner martinsumner changed the title Soft retry limits in AAE full-sync Soft retry limits - AAE fullsync Oct 23, 2017
@russelldb
Copy link
Member

@martinsumner just to be clear, the process growth is on the source, or the sink?

@martinsumner
Copy link
Contributor Author

As discussed in side-channel. The aae full-sync code is difficult to trace through, and the experience of production is that testing has so far been deficient, and also the solution is so brittle as to infer there can only be a small handful of production users (perhaps only one).

Intentions for 3.0 are to look to replace with something altogether simpler, so extensive effort to unpick code and improve test coverage for 2.2.5 doesn't make sense at this stage.

So if we can improve this behaviour for soft exits only, to reduce the chance of crashes, that will be good enough. There's no need to invest significant effort in refactoring code or test to move this forward for the long-term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants