-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765
BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765
Conversation
I see that you haven't updated any README files. Would it make sense to do so? |
// Pure verify: does not mutate node "state" field. | ||
func (n *node[CHAIN_ID, HEAD, RPC]) verify(callerCtx context.Context) (err error) { | ||
// Pure verifyChainID: does not mutate node "state" field. | ||
func (n *node[CHAIN_ID, HEAD, RPC]) verifyChainID(callerCtx context.Context, lggr logger.Logger) nodeState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this return a state now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my initial thought would be an error return from the name of the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samsondav
This way we unify error handling and logging across multiple callers and unify signature with other methods that perform similar actions - verify a new connection.
Previously verify
was called directly in multiple places: outOfSyncLoop
, unreachableLoop
, invalidChainIDLoop
and start
. New syncingLoop
also needs to check that chainID matches.
All the callers had to log the error and do corresponding state transition.
With new syncing
check we would endup with something like this
if err := n.rpc.Dial(n.nodeCtx); err != nil {
lggr.Errorw("Failed to dial out-of-sync RPC node", "nodeState", n.State()).
n.declareUnreachable()
return
}
if err := n.verify(n.nodeCtx); errors.Is(err, errInvalidChainID) {
n.lfcLog.Errorw("Verify failed: Node has the wrong chain ID", "err", err)
n.declareInvalidChainID()
return
} else if err != nil {
n.lfcLog.Errorw(fmt.Sprintf("Verify failed: %v", err), "err", err)
n.declareUnreachable()
return
}
if isSyncing, err := n.verifyIsSyncing(n.nodeCtx); err != nil {
n.lfcLog.Errorw("Node is still syncing")
n.declareIsSyncing()
return
} else if err != nil {
n.lfcLog.Errorw(fmt.Sprintf("Verify is syncing failed: %v", err), "err", err)
n.declareUnreachable()
return
}
Instead we can do this
state := n.createVerifiedConn(n.nodeCtx, lggr)
if state != nodeStateAlive {
n.declareState(state)
return
}
@poopoothegorilla yes, the naming is not perfect, but I do not have better options. What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh i dont think the name is that big of a deal... I guess one option could be to return both the nodeState and error... that way you could just have the caller handle the logging and you might not have to pass the lggr through all calls
Its been a while since i read the PR sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this makes sense
// Need to redial since invalid chain ID nodes are automatically disconnected | ||
state := n.createVerifiedConn(n.nodeCtx, lggr) | ||
if state != nodeStateInvalidChainID { | ||
n.declareState(state) | ||
return | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this added? We never had a bug with it before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On transitionToInvalidChainID we call n.disconnectAll()
that closes ws connection with the RPC, but we never reconnect. As a result we do following transitions: invalidChainID
-> alive
-> unreachable
-> alive
.
I was able to reproduce the issue locally by increasing the chainIDRedialBackoff. Here are the logs:
[ERROR] Failed to verify RPC node; remote endpoint returned the wrong chain ID client/node_lifecycle.go:433 chainID=11155111 err=rpc ChainID doesn't match local chain ID: RPC ID=5, local ID=11155111, node name=Sepolia: invalid chain id logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary
Note: here we restart the RPC with proper chainID
[WARN] Failed to subscribe to heads on chain headtracker/head_listener.go:190 chainID=11155111 err=Client#SubscribeNewHead: no live nodes available logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset
[DEBUG] Subscribing to new heads on chain 11155111 headtracker/head_listener.go:180 logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset
[ERROR] Failed to verify RPC node; remote endpoint returned the wrong chain ID client/node_lifecycle.go:433 chainID=11155111 err=rpc ChainID doesn't match local chain ID: RPC ID=5, local ID=11155111, node name=Sepolia: invalid chain id logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary
[WARN] Failed to subscribe to heads on chain headtracker/head_listener.go:190 chainID=11155111 err=Client#SubscribeNewHead: no live nodes available logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset
[DEBUG] Subscribing to new heads on chain 11155111 headtracker/head_listener.go:180 logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset
[INFO] Successfully verified RPC node. Node was offline for 45.008292458s client/node_lifecycle.go:440 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=InvalidChainID nodeTier=primary version=unset@unset
[INFO] RPC Node is online client/node_fsm.go:133 chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary version=unset@unset
[DEBUG] RPC call: evmclient.Client#EthSubscribe client/rpc_client.go:352 args=["newHeads"] client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=cf22320f-8d5f-4288-848f-292d4a829a75 version=unset@unset
[DEBUG] evmclient.Client#EthSubscribe RPC call failure client/rpc_client.go:212 args=["newHeads"] callName=EthSubscribe client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary duration=0.000009458 err=client is closed evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=cf22320f-8d5f-4288-848f-292d4a829a75 rpcDomain=127.0.0.1:8545 version=unset@unset
[ERROR] Initial subscribe for heads failed client/node_lifecycle.go:103 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary pollFailureThreshold=5 pollInterval=10
[ERROR] RPC Node is unreachable client/node_fsm.go:211 chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Unreachable nodeTier=primary
[DEBUG] Trying to revive unreachable RPC node client/node_lifecycle.go:362 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Unreachable node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Unreachable nodeTier=primary version=unset@unset
[DEBUG] RPC dial: evmclient.Client#dial client/rpc_client.go:124 client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 httpuri=http://127.0.0.1:8545 logger=EVM.11155111.Client.RPC version=unset@unset wsuri=ws://127.0.0.1:8546
[DEBUG] RPC dial: evmclient.Client#dial client/rpc_client.go:152 client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 httpuri=ws://127.0.0.1:8546 logger=EVM.11155111.Client.RPC version=unset@unset
[INFO] Successfully redialled and verified RPC node (primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545. Node was offline for 1.003383459s client/node_lifecycle.go:393 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Unreachable node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Dialed nodeTier=primary version=unset@unset
[INFO] RPC Node is online client/node_fsm.go:133 chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary version=unset@unset
[DEBUG] RPC call: evmclient.Client#EthSubscribe client/rpc_client.go:352 args=["newHeads"] client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=50bee0a1-ab24-4548-a001-f654eee6cf06 version=unset@unset
[DEBUG] Head liveness checking enabled client/node_lifecycle.go:115 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset
[DEBUG] Polling enabled client/node_lifecycle.go:125 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset
[DEBUG] Polling liveness checking enabled client/node_lifecycle.go:133 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah interesting. Good catch!
@@ -471,21 +474,24 @@ func TestUnit_NodeLifecycle_outOfSyncLoop(t *testing.T) { | |||
return node.State() == nodeStateUnreachable | |||
}) | |||
}) | |||
t.Run("if fail to get chainID, transitions to invalidChainID", func(t *testing.T) { | |||
t.Run("if fail to get chainID, transitions to unreachable", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change in logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way it's possible to unify reconnection behaviour and reuse it in multiple state transitions and makes it easier to add additional verification steps.
Before the change, error handling on chainID verification error for outOfSync
loop looked like this
if err := n.verify(n.nodeCtx); err != nil {
lggr.Errorw(fmt.Sprintf("Failed to verify out-of-sync RPC node: %v", err), "err", err)
n.declareInvalidChainID()
return
}
It was not consistent with the rest of the loops and produced state transition that seems redundant outOfSync
-> invalidChainID
-> unreachable
instead of outOfSync
-> unreachable
Co-authored-by: Sam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good.
Just a couple comments.
common/client/node_lifecycle.go
Outdated
case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)): | ||
if n.nLiveNodes != nil { | ||
if l, _, _ := n.nLiveNodes(); l < 1 { | ||
lggr.Critical("RPC endpoint is still syncing, but there are no other available nodes. This RPC node will be forcibly moved back into the live pool in a degraded state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this helpful?
Do we know if a syncing node can operate even in a degraded state?
For example, if it is not caught upto latest heads, then it can't serve heads, or check for receipts, or verify transactions sent to it.
Isn't it better to just let it sync first?
common/client/node.go
Outdated
return state | ||
} | ||
|
||
isSyncing, err := n.rpc.IsSyncing(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put an additional config setting here?
NodeIsSyncingEnabled = false[default]
This will allow us to skip IsSyncing() calls for chains/clients that don't reliably implement this functionality, or have bugs around this behavior.
…e/BCI-2450-multinode-syncing-state' of github.com:smartcontractkit/chainlink into feature/BCI-2450-multinode-syncing-state
# Conflicts: # docs/CHANGELOG.md
docs/CHANGELOG.md
Outdated
@@ -25,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
echo "Baz=Val" >> median.env | |||
CL_MEDIAN_ENV="median.env" | |||
``` | |||
>>>>>>> develop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an artifact left over from merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks
return node.State() == nodeStateAlive | ||
}) | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice job on the tests
common/client/node_fsm_test.go
Outdated
|
||
func TestNodeState_String(t *testing.T) { | ||
for _, ns := range allNodeStates { | ||
assert.NotContains(t, ns.String(), strconv.FormatInt(int64(ns), 10), "Expected node state to have readable name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description to the test
return node.State() == nodeStateAlive | ||
}) | ||
}) | ||
t.Run("on successful verification without isSyncing becomes alive", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This same description is repeated multiple times for different test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but each of them is located in a different testing functions and they cover similar transition: from unreachable to alive, from invalidChainID to alive, etc.
common/client/node_lifecycle.go
Outdated
@@ -483,14 +484,6 @@ func (n *node[CHAIN_ID, HEAD, RPC]) syncingLoop() { | |||
lggr.Infow(fmt.Sprintf("Successfully verified RPC node. Node was syncing for %s", time.Since(syncingAt)), "nodeState", n.State()) | |||
n.declareAlive() | |||
return | |||
case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason for removing this?
Are we now not forcing to use a syncing node if that's the only node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following your suggestion, I've run some tests, and it turns out that RPC is not able to properly serve requests while it's syncing and marking such node as Alive leads to a lot of error noise.
Quality Gate passedIssues Measures |
This came up as a follow-up of testing on canary, by QA teams.
If an RPC node is newly setup, and still syncing the blockchain, it may not be ready for taking client requests. Currently, our MultiNode client marks this node as OutOfSync, and logs error statements in our logs, which is confusing for some NOPS.
Changes:
syncing
to differentiate it from outOfSyncStatte.