Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCIP-1730 Implementing Healthy function in LogPoller #584

Merged
merged 7 commits into from
Mar 7, 2024

Conversation

mateusz-sekara
Copy link
Contributor

@mateusz-sekara mateusz-sekara commented Mar 6, 2024

Required for finality violation checks #583

@mateusz-sekara mateusz-sekara force-pushed the logpoller-reorg-check branch from 6166a8a to 0a1aca4 Compare March 7, 2024 08:12
@mateusz-sekara mateusz-sekara changed the title CCIP-1730 PoC exposing isHealthy function in LogPoller CCIP-1730 Exposing isHealthy function in LogPoller Mar 7, 2024
@mateusz-sekara mateusz-sekara force-pushed the logpoller-reorg-check branch from 0a1aca4 to 2841718 Compare March 7, 2024 08:13
@mateusz-sekara mateusz-sekara marked this pull request as ready for review March 7, 2024 08:13
@mateusz-sekara mateusz-sekara requested a review from a team as a code owner March 7, 2024 08:13
@mateusz-sekara mateusz-sekara changed the title CCIP-1730 Exposing isHealthy function in LogPoller CCIP-1730 Implementing ready function in LogPoller Mar 7, 2024
Copy link
Contributor

@makramkd makramkd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to go from finalityViolated == true to finalityViolated == false during normal operation without restarting and changing finality depth manually?

@mateusz-sekara
Copy link
Contributor Author

Is it possible to go from finalityViolated == true to finalityViolated == false during normal operation without restarting and changing finality depth manually?

Deleting blocks and logs from the database doesn't require a restart. Please see the test, this case is included there ;)

@mateusz-sekara mateusz-sekara requested review from connorwstein, reductionista and makramkd and removed request for reductionista March 7, 2024 12:57
@@ -203,6 +213,13 @@ func (filter *Filter) Contains(other *Filter) bool {
return true
}

func (lp *logPoller) Ready() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good overall but what's the difference between Ready and Healthy in this context? We already have a HealthReport method that calls lp.Healthy() but since that's not implemented on the LP itself it calls it on the StateMachine object from chainlink-common which is embedded in the LP struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added "benefit" of implementing Healthy is that HealthReport will return a meaningful report now since we return whether finality has been violated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind I'm not 100% sure how these functions are consumed outside of this particular context we want to consume them in now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice catch, I wasn't aware of healthy function when going through the interfaces. @reductionista could you please advise which is better for this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe healthy/ready are inherited from k8s probes. If not ready k8s won't send traffic yet and if not healthy k8s will restart the node. We only expose the ready one to k8s directly

// NOTE: We only implement the k8s readiness check, *not* the liveness check. Liveness checks are only recommended in cases
and have a separate health report. Healthy definitely makes more sense here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Healthy() error is not present in LogPoller's interface, probably it has to be added first 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the services.StateMachine that's embedded in the logPoller struct - we have to override it to get useful behavior IIUC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you mean the LogPoller interface - that's a good point. We should add it there if we want clients to call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, already added it, I'm just a bit worried about what @connorwstein said about that function being used for k8s tooling. If that's the case, then we don't want to restart pod whenever LogPoller is marked as not healthy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is the SME on this? Maybe someone from infra or maybe core?

@@ -924,6 +943,7 @@ func (lp *logPoller) findBlockAfterLCA(ctx context.Context, current *evmtypes.He
lp.lggr.Criticalw("Reorg greater than finality depth detected", "finalityTag", lp.useFinalityTag, "current", current.Number, "latestFinalized", latestFinalizedBlockNumber)
rerr := errors.New("Reorg greater than finality depth")
lp.SvcErrBuffer.Append(rerr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can use this but the semantics seem weird, it seems like the buffer is flushed whenever you read it, so it's not really super usable for our use case.

@@ -203,6 +213,13 @@ func (filter *Filter) Contains(other *Filter) bool {
return true
}

func (lp *logPoller) Ready() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe healthy/ready are inherited from k8s probes. If not ready k8s won't send traffic yet and if not healthy k8s will restart the node. We only expose the ready one to k8s directly

// NOTE: We only implement the k8s readiness check, *not* the liveness check. Liveness checks are only recommended in cases
and have a separate health report. Healthy definitely makes more sense here

@@ -758,9 +775,11 @@ func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, curren
// We return an error here which will cause us to restart polling from lastBlockSaved + 1
return nil, err2
}
lp.finalityViolated.Store(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think still an edge case here: say a finality violation is detected and you wipe the db. We'll hit the "first poll ever on new chain" return case here and leave the bool true. I think you want a defer where any non-error return value from this function we set to false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you suggest marking it as false at the end of PollAndSave tick? It would be definitely easier if errors were bubbled up, but still doable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagining something like this, can be a follow up

func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, currentBlockNumber int64, currentBlock *evmtypes.Head) (*evmtypes.Head, err error) {
	defer func() {
                // Note careful about shadowing etc
		if err != nil {
			lp.finalityViolated.Store(false)
		}
	}()

@mateusz-sekara mateusz-sekara changed the title CCIP-1730 Implementing ready function in LogPoller CCIP-1730 Implementing Healthy function in LogPoller Mar 7, 2024
@@ -758,9 +775,11 @@ func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, curren
// We return an error here which will cause us to restart polling from lastBlockSaved + 1
return nil, err2
}
lp.finalityViolated.Store(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagining something like this, can be a follow up

func (lp *logPoller) getCurrentBlockMaybeHandleReorg(ctx context.Context, currentBlockNumber int64, currentBlock *evmtypes.Head) (*evmtypes.Head, err error) {
	defer func() {
                // Note careful about shadowing etc
		if err != nil {
			lp.finalityViolated.Store(false)
		}
	}()

@mateusz-sekara mateusz-sekara merged commit 9f70ea4 into ccip-develop Mar 7, 2024
134 of 135 checks passed
@mateusz-sekara mateusz-sekara deleted the logpoller-reorg-check branch March 7, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants