Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ZK Overflow Status Checker and Idempotency Key Management #13496

Closed
wants to merge 6 commits into from

Conversation

0xnogo
Copy link
Collaborator

@0xnogo 0xnogo commented Jun 11, 2024

Motivation

  • Improve handling of ZK overflow chains in CCIP
  • Implement a mechanism to query transaction statuses using the Transaction Manager (TXM): hence avoiding having a shared state between Transmission and Execution Plugin

Solution

  • Status Checker Implementation:
    • Implemented CheckMessageStatus to retrieve transaction statuses.
    • Query TXM to get all the transactions sent for a specific messageId.
  • The method will be used by the Transmission component to obtain the next retry count and construct the IdempotencyKey.
  • The Execution plugin will use this method to retrieve all transaction statuses and snooze a message if any status is Fatal (indicating ZK overflow).
  • Transmission changes are backward compatible with non-ZK messages. A unique IdempotencyKey is set for each batch, ensuring the TXM submits the transaction without a no-op. This change does not impact other products as it checks if TxMeta contains the MessageIDs field (specific to CCIP) and if the status checker is set. Returning a nil IdempotencyKey maintains the existing behavior.

Notes

@0xnogo 0xnogo changed the title status checker Implement ZK Overflow Status Checker and Idempotency Key Management Jun 11, 2024
@0xnogo 0xnogo requested a review from a team June 11, 2024 09:20
@0xnogo 0xnogo marked this pull request as ready for review June 11, 2024 11:36
@0xnogo 0xnogo requested a review from a team as a code owner June 11, 2024 11:36
Comment on lines +14 to +22
// MockTxManager is a mock implementation of TxManager
type MockTxManager struct {
mock.Mock
}

func (m *MockTxManager) GetTransactionStatus(ctx context.Context, transactionID string) (TransactionStatus, error) {
args := m.Called(ctx, transactionID)
return args.Get(0).(TransactionStatus), args.Error(1)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generate this using mockery, avoid hand-writing mocks (at least in this repo)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to mock a TXM object. As the PR is not ready yet decided to mock my fake TxManager. I think we should use the TXM mock present in TXM mock here

transactionID := fmt.Sprintf("%s-%d", msgID, counter)
status, err := tsc.txManager.GetTransactionStatus(ctx, transactionID)
if err != nil {
if status == Unknown && strings.Contains(err.Error(), fmt.Sprintf("failed to find transaction with IdempotencyKey %s", transactionID)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If err != nil then status may not have a usable value, or at least, shouldn't. Lets make sure of the API contract here because this is kinda unidiomatic code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/smartcontractkit/chainlink/pull/13040/files

func (b *Txm[CHAIN_ID, HEAD, ADDR, TX_HASH, BLOCK_HASH, R, SEQ, FEE]) GetTransactionStatus(ctx context.Context, transactionID string) (status commontypes.TransactionStatus, err error) {
	// Loads attempts and receipts in the transaction
	tx, err := b.txStore.FindTxWithIdempotencyKey(ctx, transactionID, b.chainID)
	if err != nil {
		return status, fmt.Errorf("failed to find transaction with IdempotencyKey %s: %w", transactionID, err)
	}
	// This check is required since a no-rows error returns nil err
	if tx == nil {
		return status, fmt.Errorf("failed to find transaction with IdempotencyKey %s", transactionID)
	}
	switch tx.State {
	case TxUnconfirmed, TxConfirmedMissingReceipt:
		// Return unconfirmed for ConfirmedMissingReceipt since a receipt is required to determine if it is finalized
		return commontypes.Unconfirmed, nil
	case TxConfirmed:
		// TODO: Check for finality and return finalized status
		// Return unconfirmed if tx receipt's block is newer than the latest finalized block
		return commontypes.Unconfirmed, nil
	case TxFatalError:
		// Use an ErrorClassifier to determine if the transaction is considered Fatal
		txErr := b.newErrorClassifier(tx.GetError())
		if txErr != nil && txErr.IsFatal() {
			return commontypes.Fatal, tx.GetError()
		}
		// Return failed for all other tx's marked as FatalError
		return commontypes.Failed, tx.GetError()
	default:
		// Unstarted and InProgress are classified as unknown since they are not supported by the ChainWriter interface
		return commontypes.Unknown, nil
	}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status will always return the default value Unknown in this case. But I agree that this check is useless. We could just rely on the error check.

counter++
}

return allStatuses, counter - 1, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all statuses or just the latest is good enough? Need all right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to put a cap (even if its like 1,000) on the # of statuses to return to avoid getting into an infinite loop here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably would want to skip that msg anyway if it was retried 1000s of times

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all statuses or just the latest is good enough? Need all right?

Yes we would need all. The main scenario is a message (mi-0) submitted at round r0 and only gets resolved at r2. if we fetch the status mi-2 we won't know what the message is a zko has only mi-0 resolved

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to put a cap (even if its like 1,000) on the # of statuses to return to avoid getting into an infinite loop here?

Good point. And yes, we should probably return an error and consider that something is wrong with the message. So returning an error here will signal the batching logic to snooze it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rough math:
deltaRound of 30s and 1000 retries so that 1000*30/3600 = 8h.

Say the chain has an outage, we will allow up to 8 hours of outage before dropping the message. It is unlikely to happen but it is leaving us enough room

core/services/ocrcommon/transmitter.go Show resolved Hide resolved
@0xnogo
Copy link
Collaborator Author

0xnogo commented Jun 14, 2024

Closing this PR in favor of smartcontractkit/ccip#947

@0xnogo 0xnogo closed this Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants