Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765

Merged
merged 15 commits into from
Feb 27, 2024

Conversation

dhaidashenko
Copy link
Collaborator

@dhaidashenko dhaidashenko commented Jan 12, 2024

This came up as a follow-up of testing on canary, by QA teams.

If an RPC node is newly setup, and still syncing the blockchain, it may not be ready for taking client requests. Currently, our MultiNode client marks this node as OutOfSync, and logs error statements in our logs, which is confusing for some NOPS.

Changes:

  • Added separate node state syncing to differentiate it from outOfSyncStatte.
  • Fixed reconnection bug for nodes with invalidChainID. We close ws connection, when transitioning node to this state, but do not reconnect.

Copy link
Contributor

I see that you haven't updated any README files. Would it make sense to do so?

@dhaidashenko dhaidashenko changed the title syncing state for multinode BCI-2450: Health check for MultiNode to ensure that node is not in syncing state Jan 12, 2024
@dhaidashenko dhaidashenko marked this pull request as ready for review January 12, 2024 20:35
common/client/node_fsm.go Outdated Show resolved Hide resolved
// Pure verify: does not mutate node "state" field.
func (n *node[CHAIN_ID, HEAD, RPC]) verify(callerCtx context.Context) (err error) {
// Pure verifyChainID: does not mutate node "state" field.
func (n *node[CHAIN_ID, HEAD, RPC]) verifyChainID(callerCtx context.Context, lggr logger.Logger) nodeState {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this return a state now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial thought would be an error return from the name of the method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samsondav
This way we unify error handling and logging across multiple callers and unify signature with other methods that perform similar actions - verify a new connection.

Previously verify was called directly in multiple places: outOfSyncLoop, unreachableLoop, invalidChainIDLoop and start. New syncingLoop also needs to check that chainID matches.
All the callers had to log the error and do corresponding state transition.
With new syncing check we would endup with something like this

if err := n.rpc.Dial(n.nodeCtx); err != nil {
    lggr.Errorw("Failed to dial out-of-sync RPC node", "nodeState", n.State()).
    n.declareUnreachable()
    return
}

if err := n.verify(n.nodeCtx); errors.Is(err, errInvalidChainID) {
    n.lfcLog.Errorw("Verify failed: Node has the wrong chain ID", "err", err)
    n.declareInvalidChainID()
    return
} else if err != nil {
    n.lfcLog.Errorw(fmt.Sprintf("Verify failed: %v", err), "err", err)
    n.declareUnreachable()
    return
}

if isSyncing, err := n.verifyIsSyncing(n.nodeCtx); err != nil {
    n.lfcLog.Errorw("Node is still syncing")
    n.declareIsSyncing()
    return
} else if err != nil {
    n.lfcLog.Errorw(fmt.Sprintf("Verify is syncing failed: %v", err), "err", err)
    n.declareUnreachable()
    return
}

Instead we can do this

state := n.createVerifiedConn(n.nodeCtx, lggr)
if state != nodeStateAlive {
    n.declareState(state)
    return
}

@poopoothegorilla yes, the naming is not perfect, but I do not have better options. What do you suggest?

Copy link
Contributor

@poopoothegorilla poopoothegorilla Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh i dont think the name is that big of a deal... I guess one option could be to return both the nodeState and error... that way you could just have the caller handle the logging and you might not have to pass the lggr through all calls

Its been a while since i read the PR sorry

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this makes sense

Comment on lines +407 to +413
// Need to redial since invalid chain ID nodes are automatically disconnected
state := n.createVerifiedConn(n.nodeCtx, lggr)
if state != nodeStateInvalidChainID {
n.declareState(state)
return
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added? We never had a bug with it before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On transitionToInvalidChainID we call n.disconnectAll() that closes ws connection with the RPC, but we never reconnect. As a result we do following transitions: invalidChainID -> alive -> unreachable -> alive.
I was able to reproduce the issue locally by increasing the chainIDRedialBackoff. Here are the logs:

[ERROR] Failed to verify RPC node; remote endpoint returned the wrong chain ID client/node_lifecycle.go:433     chainID=11155111 err=rpc ChainID doesn't match local chain ID: RPC ID=5, local ID=11155111, node name=Sepolia: invalid chain id logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary 


Note: here we restart the RPC with proper chainID


[WARN]  Failed to subscribe to heads on chain              headtracker/head_listener.go:190 chainID=11155111 err=Client#SubscribeNewHead: no live nodes available logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset 
[DEBUG] Subscribing to new heads on chain 11155111         headtracker/head_listener.go:180 logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset 
[ERROR] Failed to verify RPC node; remote endpoint returned the wrong chain ID client/node_lifecycle.go:433     chainID=11155111 err=rpc ChainID doesn't match local chain ID: RPC ID=5, local ID=11155111, node name=Sepolia: invalid chain id logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary 
[WARN]  Failed to subscribe to heads on chain              headtracker/head_listener.go:190 chainID=11155111 err=Client#SubscribeNewHead: no live nodes available logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset 
[DEBUG] Subscribing to new heads on chain 11155111         headtracker/head_listener.go:180 logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset 
[INFO]  Successfully verified RPC node. Node was offline for 45.008292458s client/node_lifecycle.go:440     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=InvalidChainID nodeTier=primary version=unset@unset 
[INFO]  RPC Node is online                                 client/node_fsm.go:133           chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary version=unset@unset 
[DEBUG] RPC call: evmclient.Client#EthSubscribe            client/rpc_client.go:352         args=["newHeads"] client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=cf22320f-8d5f-4288-848f-292d4a829a75 version=unset@unset 
[DEBUG] evmclient.Client#EthSubscribe RPC call failure     client/rpc_client.go:212         args=["newHeads"] callName=EthSubscribe client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary duration=0.000009458 err=client is closed evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=cf22320f-8d5f-4288-848f-292d4a829a75 rpcDomain=127.0.0.1:8545 version=unset@unset 
[ERROR] Initial subscribe for heads failed                 client/node_lifecycle.go:103     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary pollFailureThreshold=5 pollInterval=10 
[ERROR] RPC Node is unreachable                            client/node_fsm.go:211           chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Unreachable nodeTier=primary 
[DEBUG] Trying to revive unreachable RPC node              client/node_lifecycle.go:362     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Unreachable node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Unreachable nodeTier=primary version=unset@unset 
[DEBUG] RPC dial: evmclient.Client#dial                    client/rpc_client.go:124         client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 httpuri=http://127.0.0.1:8545 logger=EVM.11155111.Client.RPC version=unset@unset wsuri=ws://127.0.0.1:8546 
[DEBUG] RPC dial: evmclient.Client#dial                    client/rpc_client.go:152         client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 httpuri=ws://127.0.0.1:8546 logger=EVM.11155111.Client.RPC version=unset@unset 
[INFO]  Successfully redialled and verified RPC node (primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545. Node was offline for 1.003383459s client/node_lifecycle.go:393     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Unreachable node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Dialed nodeTier=primary version=unset@unset 
[INFO]  RPC Node is online                                 client/node_fsm.go:133           chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary version=unset@unset 
[DEBUG] RPC call: evmclient.Client#EthSubscribe            client/rpc_client.go:352         args=["newHeads"] client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=50bee0a1-ab24-4548-a001-f654eee6cf06 version=unset@unset 
[DEBUG] Head liveness checking enabled                     client/node_lifecycle.go:115     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset 
[DEBUG] Polling enabled                                    client/node_lifecycle.go:125     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset 
[DEBUG] Polling liveness checking enabled                  client/node_lifecycle.go:133     chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah interesting. Good catch!

@@ -471,21 +474,24 @@ func TestUnit_NodeLifecycle_outOfSyncLoop(t *testing.T) {
return node.State() == nodeStateUnreachable
})
})
t.Run("if fail to get chainID, transitions to invalidChainID", func(t *testing.T) {
t.Run("if fail to get chainID, transitions to unreachable", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change in logic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way it's possible to unify reconnection behaviour and reuse it in multiple state transitions and makes it easier to add additional verification steps.

Before the change, error handling on chainID verification error for outOfSync loop looked like this

if err := n.verify(n.nodeCtx); err != nil {
		lggr.Errorw(fmt.Sprintf("Failed to verify out-of-sync RPC node: %v", err), "err", err)
		n.declareInvalidChainID()
		return
}

It was not consistent with the rest of the loops and produced state transition that seems redundant outOfSync -> invalidChainID -> unreachable instead of outOfSync -> unreachable

Copy link
Contributor

@prashantkumar1982 prashantkumar1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good.
Just a couple comments.

case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)):
if n.nLiveNodes != nil {
if l, _, _ := n.nLiveNodes(); l < 1 {
lggr.Critical("RPC endpoint is still syncing, but there are no other available nodes. This RPC node will be forcibly moved back into the live pool in a degraded state")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this helpful?
Do we know if a syncing node can operate even in a degraded state?
For example, if it is not caught upto latest heads, then it can't serve heads, or check for receipts, or verify transactions sent to it.
Isn't it better to just let it sync first?

return state
}

isSyncing, err := n.rpc.IsSyncing(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put an additional config setting here?
NodeIsSyncingEnabled = false[default]

This will allow us to skip IsSyncing() calls for chains/clients that don't reliably implement this functionality, or have bugs around this behavior.

…e/BCI-2450-multinode-syncing-state' of github.com:smartcontractkit/chainlink into feature/BCI-2450-multinode-syncing-state
@dhaidashenko dhaidashenko requested a review from a team as a code owner February 9, 2024 18:27
@@ -25,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
echo "Baz=Val" >> median.env
CL_MEDIAN_ENV="median.env"
```
>>>>>>> develop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an artifact left over from merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks

return node.State() == nodeStateAlive
})
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job on the tests


func TestNodeState_String(t *testing.T) {
for _, ns := range allNodeStates {
assert.NotContains(t, ns.String(), strconv.FormatInt(int64(ns), 10), "Expected node state to have readable name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description to the test

return node.State() == nodeStateAlive
})
})
t.Run("on successful verification without isSyncing becomes alive", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This same description is repeated multiple times for different test cases

Copy link
Collaborator Author

@dhaidashenko dhaidashenko Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but each of them is located in a different testing functions and they cover similar transition: from unreachable to alive, from invalidChainID to alive, etc.

@@ -483,14 +484,6 @@ func (n *node[CHAIN_ID, HEAD, RPC]) syncingLoop() {
lggr.Infow(fmt.Sprintf("Successfully verified RPC node. Node was syncing for %s", time.Since(syncingAt)), "nodeState", n.State())
n.declareAlive()
return
case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for removing this?
Are we now not forcing to use a syncing node if that's the only node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your suggestion, I've run some tests, and it turns out that RPC is not able to properly serve requests while it's syncing and marking such node as Alive leads to a lot of error noise.

samsondav
samsondav previously approved these changes Feb 21, 2024
@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Feb 27, 2024
Merged via the queue into develop with commit 17287df Feb 27, 2024
97 checks passed
@prashantkumar1982 prashantkumar1982 deleted the feature/BCI-2450-multinode-syncing-state branch February 27, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants