You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I managed to hit an AssertionError in AppendEntriesRequestProcessor. Apparently, there is a race.
The crash was observed on a custom branch, though the code in master is the same.
First, the code:
AppendEntriesRequestProcessor.PeerExecutorSelector has the following code (Intentionally removed unrelated lines):
public Executor select(final String reqClass, final Object reqHeader) {
// ...
final Node node = NodeManager.getInstance().get(groupId, peer);
if (node == null || !node.getRaftOptions().isReplicatorPipeline()) {
return executor();
}
// The node enable pipeline, we should ensure bolt support it.
RpcFactoryHelper.rpcFactory().ensurePipeline();
final PeerRequestContext ctx = getOrCreatePeerRequestContext(groupId, pairOf(peerId, serverId), null);
return ctx.executor;
}
getOrCreatePeerRequestContext looks as follows:
PeerRequestContext getOrCreatePeerRequestContext(final String groupId, final PeerPair pair, final Connection conn) {
ConcurrentMap<PeerPair, PeerRequestContext> groupContexts = this.peerRequestContexts.get(groupId);
// ....
PeerRequestContext peerCtx = groupContexts.get(pair);
if (peerCtx == null) {
synchronized (Utils.withLockObject(groupContexts)) {
peerCtx = groupContexts.get(pair);
// double check in lock
if (peerCtx == null) {
// only one thread to process append entries for every jraft node
final PeerId peer = new PeerId();
final boolean parsed = peer.parse(pair.local);
assert (parsed);
final Node node = NodeManager.getInstance().get(groupId, peer);
assert (node != null); // <<<<<<<<<<<<<<AssertionError here!
peerCtx = new PeerRequestContext(groupId, pair, node.getRaftOptions()
.getMaxReplicatorInflightMsgs());
groupContexts.put(pair, peerCtx);
}
}
}
// ...
return peerCtx;
}
Execution flow
I don't have a specific code to reproduce this issue, but the flow is simple. I observed a slight delay in messaging/threads which ended up with an error.
My assumptions regarding the execution flow are:
select is called. NodeManager.getInstance().get(groupId, peer) returns a non-null result, continue to getOrCreatePeerRequestContext
Another thread stops the app,NodeManager.getInstance().remove() is called for this node.
Inside getOrCreatePeerRequestContext the result of final Node node = NodeManager.getInstance().get(groupId, peer); is null, since the node has already been removed moments ago.
The execution crash on the following line assert (node != null);
The text was updated successfully, but these errors were encountered:
I agree with you; that situation could definitely happen. I suspect the problem arises from sharing the RpcServer. The RaftGroupService closes before the RpcServer, causing a delay in processing new requests because the raft service is already closed by then.
I managed to hit an AssertionError in AppendEntriesRequestProcessor. Apparently, there is a race.
The crash was observed on a custom branch, though the code in master is the same.
First, the code:
AppendEntriesRequestProcessor.PeerExecutorSelector
has the following code (Intentionally removed unrelated lines):getOrCreatePeerRequestContext
looks as follows:Execution flow
I don't have a specific code to reproduce this issue, but the flow is simple. I observed a slight delay in messaging/threads which ended up with an error.
My assumptions regarding the execution flow are:
select
is called.NodeManager.getInstance().get(groupId, peer)
returns a non-null result, continue togetOrCreatePeerRequestContext
NodeManager.getInstance().remove()
is called for this node.getOrCreatePeerRequestContext
the result offinal Node node = NodeManager.getInstance().get(groupId, peer);
is null, since the node has already been removed moments ago.assert (node != null);
The text was updated successfully, but these errors were encountered: