-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCIP-1704 More robust checks of RMN/Chain every phase #583
CCIP-1704 More robust checks of RMN/Chain every phase #583
Conversation
3643d1d
to
cd73f9b
Compare
af34ca8
to
f2a5cef
Compare
dc89d0c
to
e64ee14
Compare
e64ee14
to
b706f64
Compare
539f1a0
to
af33aa8
Compare
bcd27e6
to
413fe1e
Compare
413fe1e
to
8e057c2
Compare
8e057c2
to
d95865a
Compare
d95865a
to
3a8f100
Compare
c.lggr.Criticalw( | ||
"Source or destination chain is unhealthy", | ||
"sourceChainHealthy", sourceChainHealthy, | ||
"destChainHealthy", destChainHealthy, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we actually detect finality violations or RMN curse statuses, will we produce a critical error on every OCR round? If so, is there a way to limit the error frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the only log produced by the plugin because it won't be able to work so it should not increase the volume of logs, I'm not concerned about that.
@@ -18,6 +18,9 @@ type CommitStoreReader interface { | |||
// Returned Commit Reports have to be sorted by Interval.Min/Interval.Max in ascending order. | |||
GetAcceptedCommitReportsGteTimestamp(ctx context.Context, ts time.Time, confirmations int) ([]CommitStoreReportWithTxMeta, error) | |||
|
|||
// IsDestChainHealthy returns true if the destination chain is healthy. | |||
IsDestChainHealthy(ctx context.Context) (bool, error) | |||
|
|||
IsDown(ctx context.Context) (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can merge IsDown and IsDestChainHealthy into just a Healthy()? I think its cleaner to just abstract the reader healthiness from the higher levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about that, but decided to have these independently because
IsDown(ctx context.Context) (bool, error)
requires RPC call which is cached on the Healthcheck side
IsDestChainHealthy(ctx context.Context) (bool, error)
is very cheap, because it checks only a single field in LogPoller - no RPC, no db,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the interface readability level it's definitely better to have them merged, from the performance point of views it's better to have them separated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest keeping them separated for now and maybe verifying merge possibilities later (maybe after load tests)
@@ -177,6 +177,13 @@ func (o *OnRamp) RouterAddress() (cciptypes.Address, error) { | |||
return ccipcalc.EvmAddrToGeneric(config.Router), nil | |||
} | |||
|
|||
func (o *OnRamp) IsSourceChainHealthy(context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side question - curious what changed in the onramp in 1.5 that forced a new impl here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was just copy pasted based on 1.2, not sure what has changed, just making them consistent
//go:generate mockery --quiet --name ChainHealthcheck --filename chain_health_mock.go --case=underscore | ||
type ChainHealthcheck interface { | ||
// IsHealthy checks if the chain is healthy and returns true if it is, false otherwise | ||
IsHealthy(ctx context.Context) (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe simpler as as single function IsHealthy(ctx context.Context, force bool) (bool, error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I thought that having separate methods is less error-prone, so it's harder to accidentally pass false instead of true. Don't have a strong opinion here, I can change that to signature you suggested
isSourceCursed bool | ||
) | ||
|
||
eg.Go(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the granularity here or is just a general "reader is unhealthy" sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? I'm running these in parallel, because both of these calls are blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to the earlier comment on combining IsDown and Source/DestHealthy. If we have to keep them separate for perf thats fine
Co-authored-by: Roman Kashitsyn <[email protected]>
Co-authored-by: Roman Kashitsyn <[email protected]>
af389fe
to
2415780
Compare
d3f23d4
to
2415780
Compare
More robust way of verifying if CCIP should halt processing, it's based on four items: 1. Source chain is healthy (this is verified by checking if source LogPoller saw finality violation) 2. Dest chain is healthy (this is verified by checking if destination LogPoller saw finality violation) 3. CommitStore is down (this is verified by checking if CommitStore is down and destination RMN is not cursed) 4. Source chain is cursed (this is verified by checking if source RMN is not cursed) Whenever any of the above checks fail, the chain is considered unhealthy and the CCIP should stop processing messages. Additionally, when the chain is unhealthy, this information is considered "sticky" and is cached for 30 minutes. This may lead to some false-positives, but in this case, we want to be extra cautious and avoid executing any reorged messages. Health checks are now verified in every OCR2 phase and during Observation and ShouldTransmit we enforce reading data from chain. Additionally, to reduce the number of calls to the RPC, we cache RMN curse state for 20 seconds.
More robust way of verifying if CCIP should halt processing, it's based on four items:
Whenever any of the above checks fail, the chain is considered unhealthy and the CCIP should stop
processing messages. Additionally, when the chain is unhealthy, this information is considered "sticky"
and is cached for 30 minutes. This may lead to some false-positives, but in this case, we want to be extra cautious and avoid executing any reorged messages.
Health checks are now verified in every OCR2 phase and during Observation and ShouldTransmit we enforce reading data from chain. Additionally, to reduce the number of calls to the RPC, we cache RMN curse state for 20 seconds.