CCIP-1704 More robust checks of RMN/Chain every phase #583

mateusz-sekara · 2024-03-06T09:40:54Z

More robust way of verifying if CCIP should halt processing, it's based on four items:

Source chain is healthy (this is verified by checking if source LogPoller saw finality violation)
Dest chain is healthy (this is verified by checking if destination LogPoller saw finality violation)
CommitStore is down (this is verified by checking if CommitStore is down and destination RMN is not cursed)
Source chain is cursed (this is verified by checking if source RMN is not cursed)

Whenever any of the above checks fail, the chain is considered unhealthy and the CCIP should stop
processing messages. Additionally, when the chain is unhealthy, this information is considered "sticky"
and is cached for 30 minutes. This may lead to some false-positives, but in this case, we want to be extra cautious and avoid executing any reorged messages.

Health checks are now verified in every OCR2 phase and during Observation and ShouldTransmit we enforce reading data from chain. Additionally, to reduce the number of calls to the RPC, we cache RMN curse state for 20 seconds.

core/services/ocr2/plugins/ccip/internal/cache/chain_health.go

roman-kashitsyn · 2024-03-07T14:48:42Z

core/services/ocr2/plugins/ccip/internal/cache/chain_health.go

+		c.lggr.Criticalw(
+			"Source or destination chain is unhealthy",
+			"sourceChainHealthy", sourceChainHealthy,
+			"destChainHealthy", destChainHealthy,
+		)


If we actually detect finality violations or RMN curse statuses, will we produce a critical error on every OCR round? If so, is there a way to limit the error frequency?

This would be the only log produced by the plugin because it won't be able to work so it should not increase the volume of logs, I'm not concerned about that.

connorwstein · 2024-03-07T15:35:29Z

core/services/ocr2/plugins/ccip/cciptypes/commitstore.go

@@ -18,6 +18,9 @@ type CommitStoreReader interface {
 	// Returned Commit Reports have to be sorted by Interval.Min/Interval.Max in ascending order.
 	GetAcceptedCommitReportsGteTimestamp(ctx context.Context, ts time.Time, confirmations int) ([]CommitStoreReportWithTxMeta, error)

+	// IsDestChainHealthy returns true if the destination chain is healthy.
+	IsDestChainHealthy(ctx context.Context) (bool, error)
+
 	IsDown(ctx context.Context) (bool, error)


Can merge IsDown and IsDestChainHealthy into just a Healthy()? I think its cleaner to just abstract the reader healthiness from the higher levels

Yeah, I thought about that, but decided to have these independently because
IsDown(ctx context.Context) (bool, error) requires RPC call which is cached on the Healthcheck side
IsDestChainHealthy(ctx context.Context) (bool, error) is very cheap, because it checks only a single field in LogPoller - no RPC, no db,

From the interface readability level it's definitely better to have them merged, from the performance point of views it's better to have them separated

I would suggest keeping them separated for now and maybe verifying merge possibilities later (maybe after load tests)

connorwstein · 2024-03-07T15:38:39Z

core/services/ocr2/plugins/ccip/internal/ccipdata/v1_5_0/onramp.go

@@ -177,6 +177,13 @@ func (o *OnRamp) RouterAddress() (cciptypes.Address, error) {
 	return ccipcalc.EvmAddrToGeneric(config.Router), nil
 }

+func (o *OnRamp) IsSourceChainHealthy(context.Context) (bool, error) {


side question - curious what changed in the onramp in 1.5 that forced a new impl here?

I think it was just copy pasted based on 1.2, not sure what has changed, just making them consistent

connorwstein · 2024-03-07T15:43:42Z

core/services/ocr2/plugins/ccip/internal/cache/chain_health.go

+//go:generate mockery --quiet --name ChainHealthcheck --filename chain_health_mock.go --case=underscore
+type ChainHealthcheck interface {
+	// IsHealthy checks if the chain is healthy and returns true if it is, false otherwise
+	IsHealthy(ctx context.Context) (bool, error)


maybe simpler as as single function IsHealthy(ctx context.Context, force bool) (bool, error)

Hmm, I thought that having separate methods is less error-prone, so it's harder to accidentally pass false instead of true. Don't have a strong opinion here, I can change that to signature you suggested

connorwstein · 2024-03-07T15:49:01Z

core/services/ocr2/plugins/ccip/internal/cache/chain_health.go

+		isSourceCursed    bool
+	)
+
+	eg.Go(func() error {


do we need the granularity here or is just a general "reader is unhealthy" sufficient?

What do you mean? I'm running these in parallel, because both of these calls are blocking

related to the earlier comment on combining IsDown and Source/DestHealthy. If we have to keep them separate for perf thats fine

Co-authored-by: Roman Kashitsyn <[email protected]>

More robust way of verifying if CCIP should halt processing, it's based on four items: 1. Source chain is healthy (this is verified by checking if source LogPoller saw finality violation) 2. Dest chain is healthy (this is verified by checking if destination LogPoller saw finality violation) 3. CommitStore is down (this is verified by checking if CommitStore is down and destination RMN is not cursed) 4. Source chain is cursed (this is verified by checking if source RMN is not cursed) Whenever any of the above checks fail, the chain is considered unhealthy and the CCIP should stop processing messages. Additionally, when the chain is unhealthy, this information is considered "sticky" and is cached for 30 minutes. This may lead to some false-positives, but in this case, we want to be extra cautious and avoid executing any reorged messages. Health checks are now verified in every OCR2 phase and during Observation and ShouldTransmit we enforce reading data from chain. Additionally, to reduce the number of calls to the RPC, we cache RMN curse state for 20 seconds.

mateusz-sekara changed the title ~~Checking source curse every phase~~ CCIP-1704 Checking source curse every phase Mar 6, 2024

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 3643d1d to cd73f9b Compare March 6, 2024 12:49

mateusz-sekara temporarily deployed to sdlc March 6, 2024 12:49 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 6, 2024 12:57 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from af34ca8 to f2a5cef Compare March 6, 2024 14:30

mateusz-sekara temporarily deployed to sdlc March 6, 2024 14:30 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 7, 2024 09:04 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from dc89d0c to e64ee14 Compare March 7, 2024 10:37

mateusz-sekara temporarily deployed to sdlc March 7, 2024 10:37 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from e64ee14 to b706f64 Compare March 7, 2024 10:40

mateusz-sekara temporarily deployed to sdlc March 7, 2024 10:40 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 7, 2024 10:49 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 539f1a0 to af33aa8 Compare March 7, 2024 10:50

mateusz-sekara temporarily deployed to sdlc March 7, 2024 10:50 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 7, 2024 10:57 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from bcd27e6 to 413fe1e Compare March 7, 2024 11:06

mateusz-sekara temporarily deployed to sdlc March 7, 2024 11:06 — with GitHub Actions Inactive

mateusz-sekara requested a review from gtklocker March 7, 2024 11:07

mateusz-sekara marked this pull request as ready for review March 7, 2024 11:07

mateusz-sekara requested a review from a team as a code owner March 7, 2024 11:07

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 413fe1e to 8e057c2 Compare March 7, 2024 11:09

mateusz-sekara temporarily deployed to sdlc March 7, 2024 11:09 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 8e057c2 to d95865a Compare March 7, 2024 11:37

mateusz-sekara temporarily deployed to sdlc March 7, 2024 11:37 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from d95865a to 3a8f100 Compare March 7, 2024 11:37

mateusz-sekara temporarily deployed to sdlc March 7, 2024 11:37 — with GitHub Actions Inactive

mateusz-sekara requested a review from connorwstein March 7, 2024 12:39

mateusz-sekara changed the title ~~CCIP-1704 Checking source curse every phase~~ CCIP-1704 More robust checks of RMN/Chain every phase Mar 7, 2024

This was referenced Mar 7, 2024

CCIP-1704 Healthcheck metrics #592

Merged

CCIP-1730 Implementing Healthy function in LogPoller #584

Merged

roman-kashitsyn approved these changes Mar 7, 2024

View reviewed changes

mateusz-sekara temporarily deployed to sdlc March 7, 2024 15:10 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 7, 2024 15:15 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 7, 2024 15:24 — with GitHub Actions Inactive

connorwstein reviewed Mar 7, 2024

View reviewed changes

mateusz-sekara temporarily deployed to sdlc March 7, 2024 16:43 — with GitHub Actions Inactive

mateusz-sekara requested a review from connorwstein March 7, 2024 16:46

mateusz-sekara mentioned this pull request Mar 7, 2024

CCIP-1730 Implementing Healthy function in LogPoller #593

Closed

mateusz-sekara and others added 11 commits March 7, 2024 20:41

Using cache for keeping information about chain state

9b8f3e3

Minor refactoring

f23468b

Adding IsChainHealthy checks

7f4b301

Docs

0056d23

Minor fixes

b91f223

Update core/services/ocr2/plugins/ccip/internal/cache/chain_health.go

3c5613b

Co-authored-by: Roman Kashitsyn <[email protected]>

Update core/services/ocr2/plugins/ccip/internal/cache/chain_health.go

a1998b2

Co-authored-by: Roman Kashitsyn <[email protected]>

Return errors whenever chain healthcheck is down

e065e36

Adding more context to logs

31f0480

Post review fixes

e2c9e95

Post rebase fixes

2415780

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from af389fe to 2415780 Compare March 7, 2024 19:43

mateusz-sekara temporarily deployed to sdlc March 7, 2024 19:43 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc March 7, 2024 19:50 — with GitHub Actions Inactive

mateusz-sekara force-pushed the checking-source-curse-every-phase branch from d3f23d4 to 2415780 Compare March 7, 2024 19:53

mateusz-sekara temporarily deployed to sdlc March 7, 2024 19:54 — with GitHub Actions Inactive

connorwstein approved these changes Mar 7, 2024

View reviewed changes

mateusz-sekara merged commit 34330f1 into ccip-develop Mar 7, 2024
112 of 135 checks passed

mateusz-sekara deleted the checking-source-curse-every-phase branch March 7, 2024 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCIP-1704 More robust checks of RMN/Chain every phase #583

CCIP-1704 More robust checks of RMN/Chain every phase #583

mateusz-sekara commented Mar 6, 2024 •

edited

Loading

roman-kashitsyn Mar 7, 2024

mateusz-sekara Mar 7, 2024

connorwstein Mar 7, 2024

mateusz-sekara Mar 7, 2024

mateusz-sekara Mar 7, 2024

mateusz-sekara Mar 7, 2024

connorwstein Mar 7, 2024

mateusz-sekara Mar 7, 2024

connorwstein Mar 7, 2024

mateusz-sekara Mar 7, 2024

connorwstein Mar 7, 2024

mateusz-sekara Mar 7, 2024

connorwstein Mar 7, 2024

CCIP-1704 More robust checks of RMN/Chain every phase #583

CCIP-1704 More robust checks of RMN/Chain every phase #583

Conversation

mateusz-sekara commented Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateusz-sekara commented Mar 6, 2024 •

edited

Loading