BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765

dhaidashenko · 2024-01-12T18:03:01Z

This came up as a follow-up of testing on canary, by QA teams.

If an RPC node is newly setup, and still syncing the blockchain, it may not be ready for taking client requests. Currently, our MultiNode client marks this node as OutOfSync, and logs error statements in our logs, which is confusing for some NOPS.

Changes:

Added separate node state syncing to differentiate it from outOfSyncStatte.
Fixed reconnection bug for nodes with invalidChainID. We close ws connection, when transitioning node to this state, but do not reconnect.

github-actions · 2024-01-12T18:03:20Z

I see that you haven't updated any README files. Would it make sense to do so?

common/client/node_fsm.go

samsondav · 2024-01-16T14:06:08Z

common/client/node.go

-// Pure verify: does not mutate node "state" field.
-func (n *node[CHAIN_ID, HEAD, RPC]) verify(callerCtx context.Context) (err error) {
+// Pure verifyChainID: does not mutate node "state" field.
+func (n *node[CHAIN_ID, HEAD, RPC]) verifyChainID(callerCtx context.Context, lggr logger.Logger) nodeState {


Why does this return a state now?

my initial thought would be an error return from the name of the method

@samsondav
This way we unify error handling and logging across multiple callers and unify signature with other methods that perform similar actions - verify a new connection.

Previously verify was called directly in multiple places: outOfSyncLoop, unreachableLoop, invalidChainIDLoop and start. New syncingLoop also needs to check that chainID matches.
All the callers had to log the error and do corresponding state transition.
With new syncing check we would endup with something like this

if err := n.rpc.Dial(n.nodeCtx); err != nil { lggr.Errorw("Failed to dial out-of-sync RPC node", "nodeState", n.State()). n.declareUnreachable() return } if err := n.verify(n.nodeCtx); errors.Is(err, errInvalidChainID) { n.lfcLog.Errorw("Verify failed: Node has the wrong chain ID", "err", err) n.declareInvalidChainID() return } else if err != nil { n.lfcLog.Errorw(fmt.Sprintf("Verify failed: %v", err), "err", err) n.declareUnreachable() return } if isSyncing, err := n.verifyIsSyncing(n.nodeCtx); err != nil { n.lfcLog.Errorw("Node is still syncing") n.declareIsSyncing() return } else if err != nil { n.lfcLog.Errorw(fmt.Sprintf("Verify is syncing failed: %v", err), "err", err) n.declareUnreachable() return }

Instead we can do this

state := n.createVerifiedConn(n.nodeCtx, lggr) if state != nodeStateAlive { n.declareState(state) return }

@poopoothegorilla yes, the naming is not perfect, but I do not have better options. What do you suggest?

yeh i dont think the name is that big of a deal... ~~I guess one option could be to return both the nodeState and error... that way you could just have the caller handle the logging and you might not have to pass the lggr through all calls~~

Its been a while since i read the PR sorry

yeah this makes sense

samsondav · 2024-01-16T14:09:06Z

common/client/node_lifecycle.go

+	// Need to redial since invalid chain ID nodes are automatically disconnected
+	state := n.createVerifiedConn(n.nodeCtx, lggr)
+	if state != nodeStateInvalidChainID {
+		n.declareState(state)
+		return
+	}
+


Why was this added? We never had a bug with it before.

On transitionToInvalidChainID we call n.disconnectAll() that closes ws connection with the RPC, but we never reconnect. As a result we do following transitions: invalidChainID -> alive -> unreachable -> alive.
I was able to reproduce the issue locally by increasing the chainIDRedialBackoff. Here are the logs:

[ERROR] Failed to verify RPC node; remote endpoint returned the wrong chain ID client/node_lifecycle.go:433 chainID=11155111 err=rpc ChainID doesn't match local chain ID: RPC ID=5, local ID=11155111, node name=Sepolia: invalid chain id logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary Note: here we restart the RPC with proper chainID [WARN] Failed to subscribe to heads on chain headtracker/head_listener.go:190 chainID=11155111 err=Client#SubscribeNewHead: no live nodes available logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset [DEBUG] Subscribing to new heads on chain 11155111 headtracker/head_listener.go:180 logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset [ERROR] Failed to verify RPC node; remote endpoint returned the wrong chain ID client/node_lifecycle.go:433 chainID=11155111 err=rpc ChainID doesn't match local chain ID: RPC ID=5, local ID=11155111, node name=Sepolia: invalid chain id logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary [WARN] Failed to subscribe to heads on chain headtracker/head_listener.go:190 chainID=11155111 err=Client#SubscribeNewHead: no live nodes available logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset [DEBUG] Subscribing to new heads on chain 11155111 headtracker/head_listener.go:180 logger=EVM.11155111.HeadTracker.HeadListener version=unset@unset [INFO] Successfully verified RPC node. Node was offline for 45.008292458s client/node_lifecycle.go:440 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.InvalidChainID node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=InvalidChainID nodeTier=primary version=unset@unset [INFO] RPC Node is online client/node_fsm.go:133 chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary version=unset@unset [DEBUG] RPC call: evmclient.Client#EthSubscribe client/rpc_client.go:352 args=["newHeads"] client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=cf22320f-8d5f-4288-848f-292d4a829a75 version=unset@unset [DEBUG] evmclient.Client#EthSubscribe RPC call failure client/rpc_client.go:212 args=["newHeads"] callName=EthSubscribe client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary duration=0.000009458 err=client is closed evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=cf22320f-8d5f-4288-848f-292d4a829a75 rpcDomain=127.0.0.1:8545 version=unset@unset [ERROR] Initial subscribe for heads failed client/node_lifecycle.go:103 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary pollFailureThreshold=5 pollInterval=10 [ERROR] RPC Node is unreachable client/node_fsm.go:211 chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Unreachable nodeTier=primary [DEBUG] Trying to revive unreachable RPC node client/node_lifecycle.go:362 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Unreachable node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Unreachable nodeTier=primary version=unset@unset [DEBUG] RPC dial: evmclient.Client#dial client/rpc_client.go:124 client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 httpuri=http://127.0.0.1:8545 logger=EVM.11155111.Client.RPC version=unset@unset wsuri=ws://127.0.0.1:8546 [DEBUG] RPC dial: evmclient.Client#dial client/rpc_client.go:152 client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 httpuri=ws://127.0.0.1:8546 logger=EVM.11155111.Client.RPC version=unset@unset [INFO] Successfully redialled and verified RPC node (primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545. Node was offline for 1.003383459s client/node_lifecycle.go:393 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Unreachable node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Dialed nodeTier=primary version=unset@unset [INFO] RPC Node is online client/node_fsm.go:133 chainID=11155111 logger=EVM.11155111.Node.Lifecycle node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary version=unset@unset [DEBUG] RPC call: evmclient.Client#EthSubscribe client/rpc_client.go:352 args=["newHeads"] client=(primary)eth-primary-rpc-0:ws://127.0.0.1:8546:http://127.0.0.1:8545 clientName=eth-primary-rpc-0 clientTier=primary evmChainID=11155111 logger=EVM.11155111.Client.RPC requestID=50bee0a1-ab24-4548-a001-f654eee6cf06 version=unset@unset [DEBUG] Head liveness checking enabled client/node_lifecycle.go:115 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeState=Alive nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset [DEBUG] Polling enabled client/node_lifecycle.go:125 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset [DEBUG] Polling liveness checking enabled client/node_lifecycle.go:133 chainID=11155111 logger=EVM.11155111.Node.Lifecycle.Alive noNewHeadsTimeoutThreshold=180 node=(primary)Sepolia:ws://127.0.0.1:8546:http://127.0.0.1:8545 nodeName=Sepolia nodeOrder=100 nodeTier=primary pollFailureThreshold=5 pollInterval=10 version=unset@unset

ah interesting. Good catch!

samsondav · 2024-01-16T14:11:41Z

common/client/node_lifecycle_test.go

@@ -471,21 +474,24 @@ func TestUnit_NodeLifecycle_outOfSyncLoop(t *testing.T) {
 			return node.State() == nodeStateUnreachable
 		})
 	})
-	t.Run("if fail to get chainID, transitions to invalidChainID", func(t *testing.T) {
+	t.Run("if fail to get chainID, transitions to unreachable", func(t *testing.T) {


Why this change in logic?

This way it's possible to unify reconnection behaviour and reuse it in multiple state transitions and makes it easier to add additional verification steps.

Before the change, error handling on chainID verification error for outOfSync loop looked like this

if err := n.verify(n.nodeCtx); err != nil { lggr.Errorw(fmt.Sprintf("Failed to verify out-of-sync RPC node: %v", err), "err", err) n.declareInvalidChainID() return }

It was not consistent with the rest of the loops and produced state transition that seems redundant outOfSync -> invalidChainID -> unreachable instead of outOfSync -> unreachable

Co-authored-by: Sam <[email protected]>

prashantkumar1982

Looks pretty good.
Just a couple comments.

prashantkumar1982 · 2024-01-20T16:47:46Z

common/client/node_lifecycle.go

+		case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)):
+			if n.nLiveNodes != nil {
+				if l, _, _ := n.nLiveNodes(); l < 1 {
+					lggr.Critical("RPC endpoint is still syncing, but there are no other available nodes. This RPC node will be forcibly moved back into the live pool in a degraded state")


Is this helpful?
Do we know if a syncing node can operate even in a degraded state?
For example, if it is not caught upto latest heads, then it can't serve heads, or check for receipts, or verify transactions sent to it.
Isn't it better to just let it sync first?

prashantkumar1982 · 2024-01-20T16:51:12Z

common/client/node.go

+		return state
+	}
+
+	isSyncing, err := n.rpc.IsSyncing(ctx)


Can we put an additional config setting here?
NodeIsSyncingEnabled = false[default]

This will allow us to skip IsSyncing() calls for chains/clients that don't reliably implement this functionality, or have bugs around this behavior.

…e/BCI-2450-multinode-syncing-state' of github.com:smartcontractkit/chainlink into feature/BCI-2450-multinode-syncing-state

# Conflicts: # docs/CHANGELOG.md

poopoothegorilla · 2024-02-09T20:01:47Z

docs/CHANGELOG.md

@@ -25,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
  echo "Baz=Val" >> median.env  
  CL_MEDIAN_ENV="median.env"
  ```
+>>>>>>> develop


is this an artifact left over from merge?

yep, thanks

poopoothegorilla · 2024-02-09T20:10:15Z

common/client/node_lifecycle_test.go

+			return node.State() == nodeStateAlive
+		})
+	})
+}


nice job on the tests

poopoothegorilla · 2024-02-09T20:12:51Z

common/client/node_fsm_test.go

+
+func TestNodeState_String(t *testing.T) {
+	for _, ns := range allNodeStates {
+		assert.NotContains(t, ns.String(), strconv.FormatInt(int64(ns), 10), "Expected node state to have readable name")


what is this testing?

Added description to the test

prashantkumar1982 · 2024-02-13T02:25:32Z

common/client/node_lifecycle_test.go

+			return node.State() == nodeStateAlive
+		})
+	})
+	t.Run("on successful verification without isSyncing becomes alive", func(t *testing.T) {


nit: This same description is repeated multiple times for different test cases

Yes, but each of them is located in a different testing functions and they cover similar transition: from unreachable to alive, from invalidChainID to alive, etc.

prashantkumar1982 · 2024-02-13T02:27:29Z

common/client/node_lifecycle.go

@@ -483,14 +484,6 @@ func (n *node[CHAIN_ID, HEAD, RPC]) syncingLoop() {
 			lggr.Infow(fmt.Sprintf("Successfully verified RPC node. Node was syncing for %s", time.Since(syncingAt)), "nodeState", n.State())
 			n.declareAlive()
 			return
-		case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)):


What was the reason for removing this?
Are we now not forcing to use a syncing node if that's the only node?

Following your suggestion, I've run some tests, and it turns out that RPC is not able to properly serve requests while it's syncing and marking such node as Alive leads to a lot of error noise.

…tinode-syncing-state

cl-sonarqube-production · 2024-02-27T19:43:25Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
82.8% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

syncing state for multinode

2ec068f

dhaidashenko temporarily deployed to sdlc January 12, 2024 18:03 — with GitHub Actions Inactive

fix evm tests

09ba32d

dhaidashenko temporarily deployed to sdlc January 12, 2024 19:00 — with GitHub Actions Inactive

zombie check for syncing node

d36798f

dhaidashenko temporarily deployed to sdlc January 12, 2024 19:58 — with GitHub Actions Inactive

dhaidashenko changed the title ~~syncing state for multinode~~ BCI-2450: Health check for MultiNode to ensure that node is not in syncing state Jan 12, 2024

dhaidashenko marked this pull request as ready for review January 12, 2024 20:35

dhaidashenko requested review from samsondav, prashantkumar1982, a team and jmank88 as code owners January 12, 2024 20:35

samsondav requested changes Jan 16, 2024

View reviewed changes

Update common/client/node_fsm.go

fcba78f

Co-authored-by: Sam <[email protected]>

dhaidashenko temporarily deployed to sdlc January 16, 2024 18:07 — with GitHub Actions Inactive

dhaidashenko requested a review from samsondav January 16, 2024 19:12

prashantkumar1982 reviewed Jan 20, 2024

View reviewed changes

dhaidashenko added 2 commits February 9, 2024 19:26

make IsSyncing check optional

1b5a54e

Merge branches 'feature/BCI-2450-multinode-syncing-state' and 'featur…

265ea7f

…e/BCI-2450-multinode-syncing-state' of github.com:smartcontractkit/chainlink into feature/BCI-2450-multinode-syncing-state

dhaidashenko requested a review from a team as a code owner February 9, 2024 18:27

Merge branch 'develop' into feature/BCI-2450-multinode-syncing-state

902d49a

# Conflicts: # docs/CHANGELOG.md

dhaidashenko temporarily deployed to sdlc February 9, 2024 18:31 — with GitHub Actions Inactive

regen config docs

cd98bac

dhaidashenko temporarily deployed to sdlc February 9, 2024 19:01 — with GitHub Actions Inactive

dhaidashenko requested review from poopoothegorilla and prashantkumar1982 February 9, 2024 19:46

poopoothegorilla reviewed Feb 9, 2024

View reviewed changes

remove merge artefact

056bbee

dhaidashenko temporarily deployed to sdlc February 9, 2024 20:05 — with GitHub Actions Inactive

poopoothegorilla reviewed Feb 9, 2024

View reviewed changes

common/client/node_lifecycle_test.go

return node.State() == nodeStateAlive

})

})

}

Copy link

Contributor

poopoothegorilla Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job on the tests

poopoothegorilla reviewed Feb 9, 2024

View reviewed changes

prashantkumar1982 reviewed Feb 13, 2024

View reviewed changes

add description to the test

450962c

dhaidashenko temporarily deployed to sdlc February 13, 2024 13:57 — with GitHub Actions Inactive

dhaidashenko requested review from poopoothegorilla and prashantkumar1982 February 21, 2024 12:10

samsondav previously approved these changes Feb 21, 2024

View reviewed changes

Merge branch 'develop' into feature/BCI-2450-multinode-syncing-state

e866e07

dhaidashenko dismissed samsondav’s stale review via e866e07 February 23, 2024 11:58

dhaidashenko temporarily deployed to sdlc February 23, 2024 11:58 — with GitHub Actions Inactive

fix config docs

5b428a6

dhaidashenko temporarily deployed to sdlc February 27, 2024 17:49 — with GitHub Actions Inactive

dhaidashenko added 2 commits February 27, 2024 19:22

fix RPC client mock generation

e1fb9d3

Merge branch 'fix/BCI-2684-rpc-client-mock' into feature/BCI-2450-mul…

31c3fd2

…tinode-syncing-state

dhaidashenko temporarily deployed to sdlc February 27, 2024 19:04 — with GitHub Actions Inactive

regen mocks

690503f

dhaidashenko temporarily deployed to sdlc February 27, 2024 19:32 — with GitHub Actions Inactive

dhaidashenko requested a review from samsondav February 27, 2024 20:21

prashantkumar1982 approved these changes Feb 27, 2024

View reviewed changes

prashantkumar1982 added this pull request to the merge queue Feb 27, 2024

Merged via the queue into develop with commit 17287df Feb 27, 2024
97 checks passed

prashantkumar1982 deleted the feature/BCI-2450-multinode-syncing-state branch February 27, 2024 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765

BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765

dhaidashenko commented Jan 12, 2024 •

edited

Loading

github-actions bot commented Jan 12, 2024

samsondav Jan 16, 2024

poopoothegorilla Jan 16, 2024

dhaidashenko Jan 16, 2024

poopoothegorilla Feb 9, 2024 •

edited

Loading

samsondav Feb 13, 2024

samsondav Jan 16, 2024

dhaidashenko Jan 16, 2024

samsondav Feb 13, 2024

samsondav Jan 16, 2024

dhaidashenko Jan 16, 2024

prashantkumar1982 left a comment

prashantkumar1982 Jan 20, 2024

prashantkumar1982 Jan 20, 2024

poopoothegorilla Feb 9, 2024

dhaidashenko Feb 9, 2024

poopoothegorilla Feb 9, 2024

poopoothegorilla Feb 9, 2024

dhaidashenko Feb 13, 2024

prashantkumar1982 Feb 13, 2024

dhaidashenko Feb 13, 2024 •

edited

Loading

prashantkumar1982 Feb 13, 2024

dhaidashenko Feb 13, 2024

cl-sonarqube-production bot commented Feb 27, 2024

BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765

BCI-2450: Health check for MultiNode to ensure that node is not in syncing state #11765

Conversation

dhaidashenko commented Jan 12, 2024 • edited Loading

github-actions bot commented Jan 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poopoothegorilla Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashantkumar1982 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhaidashenko Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Feb 27, 2024

Quality Gate passed

dhaidashenko commented Jan 12, 2024 •

edited

Loading

poopoothegorilla Feb 9, 2024 •

edited

Loading

dhaidashenko Feb 13, 2024 •

edited

Loading