Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCIP-1704 More robust checks of RMN/Chain every phase #583

Merged
merged 11 commits into from
Mar 7, 2024

Conversation

mateusz-sekara
Copy link
Contributor

@mateusz-sekara mateusz-sekara commented Mar 6, 2024

More robust way of verifying if CCIP should halt processing, it's based on four items:

  1. Source chain is healthy (this is verified by checking if source LogPoller saw finality violation)
  2. Dest chain is healthy (this is verified by checking if destination LogPoller saw finality violation)
  3. CommitStore is down (this is verified by checking if CommitStore is down and destination RMN is not cursed)
  4. Source chain is cursed (this is verified by checking if source RMN is not cursed)

Whenever any of the above checks fail, the chain is considered unhealthy and the CCIP should stop
processing messages. Additionally, when the chain is unhealthy, this information is considered "sticky"
and is cached for 30 minutes. This may lead to some false-positives, but in this case, we want to be extra cautious and avoid executing any reorged messages.

Health checks are now verified in every OCR2 phase and during Observation and ShouldTransmit we enforce reading data from chain. Additionally, to reduce the number of calls to the RPC, we cache RMN curse state for 20 seconds.

@mateusz-sekara mateusz-sekara changed the title Checking source curse every phase CCIP-1704 Checking source curse every phase Mar 6, 2024
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 3643d1d to cd73f9b Compare March 6, 2024 12:49
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from af34ca8 to f2a5cef Compare March 6, 2024 14:30
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from dc89d0c to e64ee14 Compare March 7, 2024 10:37
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from e64ee14 to b706f64 Compare March 7, 2024 10:40
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 539f1a0 to af33aa8 Compare March 7, 2024 10:50
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from bcd27e6 to 413fe1e Compare March 7, 2024 11:06
@mateusz-sekara mateusz-sekara requested a review from gtklocker March 7, 2024 11:07
@mateusz-sekara mateusz-sekara marked this pull request as ready for review March 7, 2024 11:07
@mateusz-sekara mateusz-sekara requested a review from a team as a code owner March 7, 2024 11:07
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 413fe1e to 8e057c2 Compare March 7, 2024 11:09
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from 8e057c2 to d95865a Compare March 7, 2024 11:37
@mateusz-sekara mateusz-sekara force-pushed the checking-source-curse-every-phase branch from d95865a to 3a8f100 Compare March 7, 2024 11:37
@mateusz-sekara mateusz-sekara changed the title CCIP-1704 Checking source curse every phase CCIP-1704 More robust checks of RMN/Chain every phase Mar 7, 2024
Comment on lines 145 to 140
c.lggr.Criticalw(
"Source or destination chain is unhealthy",
"sourceChainHealthy", sourceChainHealthy,
"destChainHealthy", destChainHealthy,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we actually detect finality violations or RMN curse statuses, will we produce a critical error on every OCR round? If so, is there a way to limit the error frequency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the only log produced by the plugin because it won't be able to work so it should not increase the volume of logs, I'm not concerned about that.

@@ -18,6 +18,9 @@ type CommitStoreReader interface {
// Returned Commit Reports have to be sorted by Interval.Min/Interval.Max in ascending order.
GetAcceptedCommitReportsGteTimestamp(ctx context.Context, ts time.Time, confirmations int) ([]CommitStoreReportWithTxMeta, error)

// IsDestChainHealthy returns true if the destination chain is healthy.
IsDestChainHealthy(ctx context.Context) (bool, error)

IsDown(ctx context.Context) (bool, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can merge IsDown and IsDestChainHealthy into just a Healthy()? I think its cleaner to just abstract the reader healthiness from the higher levels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, but decided to have these independently because
IsDown(ctx context.Context) (bool, error) requires RPC call which is cached on the Healthcheck side
IsDestChainHealthy(ctx context.Context) (bool, error) is very cheap, because it checks only a single field in LogPoller - no RPC, no db,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the interface readability level it's definitely better to have them merged, from the performance point of views it's better to have them separated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest keeping them separated for now and maybe verifying merge possibilities later (maybe after load tests)

@@ -177,6 +177,13 @@ func (o *OnRamp) RouterAddress() (cciptypes.Address, error) {
return ccipcalc.EvmAddrToGeneric(config.Router), nil
}

func (o *OnRamp) IsSourceChainHealthy(context.Context) (bool, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side question - curious what changed in the onramp in 1.5 that forced a new impl here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was just copy pasted based on 1.2, not sure what has changed, just making them consistent

//go:generate mockery --quiet --name ChainHealthcheck --filename chain_health_mock.go --case=underscore
type ChainHealthcheck interface {
// IsHealthy checks if the chain is healthy and returns true if it is, false otherwise
IsHealthy(ctx context.Context) (bool, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe simpler as as single function IsHealthy(ctx context.Context, force bool) (bool, error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought that having separate methods is less error-prone, so it's harder to accidentally pass false instead of true. Don't have a strong opinion here, I can change that to signature you suggested

isSourceCursed bool
)

eg.Go(func() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the granularity here or is just a general "reader is unhealthy" sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I'm running these in parallel, because both of these calls are blocking

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the earlier comment on combining IsDown and Source/DestHealthy. If we have to keep them separate for perf thats fine

@mateusz-sekara mateusz-sekara merged commit 34330f1 into ccip-develop Mar 7, 2024
112 of 135 checks passed
@mateusz-sekara mateusz-sekara deleted the checking-source-curse-every-phase branch March 7, 2024 20:18
asoliman92 pushed a commit that referenced this pull request Jul 31, 2024
More robust way of verifying if CCIP should halt processing, it's based
on four items:
1. Source chain is healthy (this is verified by checking if source
LogPoller saw finality violation)
2. Dest chain is healthy (this is verified by checking if destination
LogPoller saw finality violation)
3. CommitStore is down (this is verified by checking if CommitStore is
down and destination RMN is not cursed)
4. Source chain is cursed (this is verified by checking if source RMN is
not cursed)

Whenever any of the above checks fail, the chain is considered unhealthy
and the CCIP should stop
processing messages. Additionally, when the chain is unhealthy, this
information is considered "sticky"
and is cached for 30 minutes. This may lead to some false-positives, but
in this case, we want to be extra cautious and avoid executing any
reorged messages.

Health checks are now verified in every OCR2 phase and during
Observation and ShouldTransmit we enforce reading data from chain.
Additionally, to reduce the number of calls to the RPC, we cache RMN
curse state for 20 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants