-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCIP-1730 Implementing Healthy function in LogPoller #584
Conversation
0fcf6c3
to
815df30
Compare
f40277b
to
5609cc5
Compare
5609cc5
to
f7387ea
Compare
4514f12
to
c6620fa
Compare
6166a8a
to
0a1aca4
Compare
0a1aca4
to
2841718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to go from finalityViolated == true
to finalityViolated == false
during normal operation without restarting and changing finality depth manually?
Deleting blocks and logs from the database doesn't require a restart. Please see the test, this case is included there ;) |
c4a14b0
to
b3f1c9b
Compare
@@ -203,6 +213,13 @@ func (filter *Filter) Contains(other *Filter) bool { | |||
return true | |||
} | |||
|
|||
func (lp *logPoller) Ready() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good overall but what's the difference between Ready
and Healthy
in this context? We already have a HealthReport
method that calls lp.Healthy()
but since that's not implemented on the LP itself it calls it on the StateMachine
object from chainlink-common which is embedded in the LP struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added "benefit" of implementing Healthy
is that HealthReport
will return a meaningful report now since we return whether finality has been violated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind I'm not 100% sure how these functions are consumed outside of this particular context we want to consume them in now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice catch, I wasn't aware of healthy
function when going through the interfaces. @reductionista could you please advise which is better for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe healthy/ready are inherited from k8s probes. If not ready k8s won't send traffic yet and if not healthy k8s will restart the node. We only expose the ready one to k8s directly
ccip/core/web/health_controller.go
Line 27 in 2f10153
// NOTE: We only implement the k8s readiness check, *not* the liveness check. Liveness checks are only recommended in cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Healthy() error
is not present in LogPoller's interface, probably it has to be added first 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the services.StateMachine
that's embedded in the logPoller
struct - we have to override it to get useful behavior IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you mean the LogPoller
interface - that's a good point. We should add it there if we want clients to call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, already added it, I'm just a bit worried about what @connorwstein said about that function being used for k8s tooling. If that's the case, then we don't want to restart pod whenever LogPoller is marked as not healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is the SME on this? Maybe someone from infra or maybe core?
@@ -924,6 +943,7 @@ func (lp *logPoller) findBlockAfterLCA(ctx context.Context, current *evmtypes.He | |||
lp.lggr.Criticalw("Reorg greater than finality depth detected", "finalityTag", lp.useFinalityTag, "current", current.Number, "latestFinalized", latestFinalizedBlockNumber) | |||
rerr := errors.New("Reorg greater than finality depth") | |||
lp.SvcErrBuffer.Append(rerr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we can use this but the semantics seem weird, it seems like the buffer is flushed whenever you read it, so it's not really super usable for our use case.
@@ -203,6 +213,13 @@ func (filter *Filter) Contains(other *Filter) bool { | |||
return true | |||
} | |||
|
|||
func (lp *logPoller) Ready() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe healthy/ready are inherited from k8s probes. If not ready k8s won't send traffic yet and if not healthy k8s will restart the node. We only expose the ready one to k8s directly
ccip/core/web/health_controller.go
Line 27 in 2f10153
// NOTE: We only implement the k8s readiness check, *not* the liveness check. Liveness checks are only recommended in cases |
@@ -758,9 +775,11 @@ func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, curren | |||
// We return an error here which will cause us to restart polling from lastBlockSaved + 1 | |||
return nil, err2 | |||
} | |||
lp.finalityViolated.Store(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think still an edge case here: say a finality violation is detected and you wipe the db. We'll hit the "first poll ever on new chain" return case here and leave the bool true. I think you want a defer where any non-error return value from this function we set to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you suggest marking it as false at the end of PollAndSave
tick? It would be definitely easier if errors were bubbled up, but still doable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imagining something like this, can be a follow up
func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, currentBlockNumber int64, currentBlock *evmtypes.Head) (*evmtypes.Head, err error) {
defer func() {
// Note careful about shadowing etc
if err != nil {
lp.finalityViolated.Store(false)
}
}()
@@ -758,9 +775,11 @@ func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, curren | |||
// We return an error here which will cause us to restart polling from lastBlockSaved + 1 | |||
return nil, err2 | |||
} | |||
lp.finalityViolated.Store(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imagining something like this, can be a follow up
func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, currentBlockNumber int64, currentBlock *evmtypes.Head) (*evmtypes.Head, err error) {
defer func() {
// Note careful about shadowing etc
if err != nil {
lp.finalityViolated.Store(false)
}
}()
Required for finality violation checks #583