Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to complete full-sync on hitting soft retry limit #799

Closed
martinsumner opened this issue Oct 7, 2019 · 1 comment
Closed

Failure to complete full-sync on hitting soft retry limit #799

martinsumner opened this issue Oct 7, 2019 · 1 comment

Comments

@martinsumner
Copy link
Contributor

Potentially related to - #772

The test https://github.com/basho/riak_test/blob/develop-2.9/tests/repl_aae_fullsync_blocked.erl fails intermittently.

It fails when it uses an intercept to stop a full-sync from working on some vnodes, and checks the right number of vnode sync failures has occurred on completion.

When the test fails, it fails as full-sync is never considered complete. The difference between success and failure is related to the ordering of the vnodes which the full-sync tries. If the last vnode to be sync'd does sync OK (as it is not one with an intercepted function), then the test passes, and the correct number of vnodes failures are reported. If the last vnode to be sync'd is one of those to not sync though, although the same work has completed/failed - the full-sync is never recorded as complete.

The cause of this appears to be that on hitting the soft retry limit

SoftRetryLimit < ErrorCount ->
lager:info("Discarding partition ~p since it has reached the soft exit retry limit of ~p",
[Partition#partition_info.index, SoftRetryLimit]),
ErrorExits1 = State4#state.error_exits + 1,
Dropped = [Partition#partition_info.index | State4#state.dropped],
Purgatory = queue:filter(fun({P, _}) -> P =/= Partition end,
State4#state.purgatory),
{noreply, State4#state{error_exits = ErrorExits1,
purgatory = Purgatory,
dropped = Dropped}};
the function maybe_complete_fullsync/2 isn't called. Unlike on a hard failure -
maybe_complete_fullsync(Running, State4#state{dropped = Dropped});
- and unlike on a success -
maybe_complete_fullsync(Running, State2)
.

@martinsumner
Copy link
Contributor Author

#800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant