Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NONEVM-706][Solana] - Refactor TXM + Rebroadcast Expired Tx functionality #946

Open
wants to merge 56 commits into
base: develop
Choose a base branch
from

Conversation

Farber98
Copy link
Contributor

@Farber98 Farber98 commented Nov 26, 2024

Description

Refactor confirm and sendWithRetry funcs

Adding a new TxExpirationRebroadcast toggable routine:

  • Inside confirm loop, after checking confirmations, we will determine if we need to rebroadcast expired transactions that where not processed yet. Rebroadcasting a transaction involves:
    • Removing previous tx and creating a new one with updated blockhash
      • Consideration: txid will be maintained, in case it was set by caller
    • calling sendWithRetry directly without enqueuing
    • Consideration: Implicitly it may be "bumping" despite config being off. We may get a new higher price for the rebroadcasted tx when we build and get a new compute unit price.

https://smartcontract-it.atlassian.net/browse/NONEVM-706

prev branch: #928

Merge together with:

Soak Tests:

amit-momin
amit-momin previously approved these changes Dec 11, 2024
// rebroadcastExpiredTxs attempts to rebroadcast all transactions that are in broadcasted state and have expired.
// An expired tx is one where it's blockhash lastValidBlockHeight (last valid block number) is smaller than the current block height (block number).
// If any error occurs during rebroadcast attempt, they are discarded, and the function continues with the next transaction.
func (txm *Txm) rebroadcastExpiredTxs(ctx context.Context, client client.ReaderWriter) {
Copy link
Collaborator

@aalu1418 aalu1418 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like there's a potential race condition here

  • tx is broadcast to the network and is included just before blockhash expiration
  • confirmer does not see it yet because it waits for confirmed commitment level
  • TXM marks a base transaction as expired because the block threshold has passed
  • deletes the saved tx id from the in-memory storage, and rebroadcasts a tx with a new blockhash
  • this could go on forever because it can never confirm that a successful tx is confirmed because the tx.id is always deleted before

Copy link
Contributor Author

@Farber98 Farber98 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be handled gracefully by the existing code?

The processConfirmations method is called before rebroadcastExpiredTxs updating each signature status. Here we would detect if we have transitioned from Broadcasted to Processed.

Then inside rebroadcastExpiredTxs we call ListAllExpiredBroadcastedTxs, where we only consider the ones which have Broadcasted state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would avoid that scenario since we only call rebroadcastExpiredTxs after processing the latest confirmation status changes. Since this method only processes Broadcasted transaction, the confirmation logic would handle any last minute state changes before we assess transaction for expiry. So in your scenario, we would notice that the transaction advanced to Processed, mark it accordingly in our in-memory storage, then move on to the expiry check. The expiry check would notice there aren't any transactions in Broadcasted state that are expired and just move on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a transaction skips the processed phase?
processed is just the view of a single RPC node - so if a transaction can go from processed to non-existent, then it might also be possible to go from non-existent to confirmed

it just feels odd to delete all previous attempts at a transaction and then recreate - rather than just updating the transaction with a new blockhash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would advance the state from non-existent (internally what we consider Broadcasted) to Confirmed and also avoid rebroadcasting with a new hash in that case. Our state transition methods don't require us to move one step up at a time in case of slow polling or other reasons. We can even jump straight to Finalized. We only delete all previous attempts/sigs if we can confirm that the tx has a non-existent state and its blockhash has expired. I was thinking it was better to clean up those old attempts/sigs than keep them around and waste cycles on status checks on them since they're expired anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - the chances of this race happening might be low

is there logging around a transaction signature that doesn't match any tx id? something that fires if an assumption is broken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the state transition methods gets a sig that doesn't match to an ID in the internal map, it should return a ErrSigDoesNotExist error. Depending where the method was called, we have some logging to surface an assumption violation. But not in all cases like the confirmation expiration where we can maybe assume a different process cleaned up the signatures.

@@ -1186,3 +1209,432 @@ func addSigAndLimitToTx(t *testing.T, keystore SimpleKeystore, pubkey solana.Pub
require.NoError(t, fees.SetComputeUnitLimit(&txCopy, limit))
return &txCopy
}

func TestTxm_ExpirationRebroadcast(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests only runs against a mocked endpoints under ideal behavior - please add additional tests that run against a live validator

Copy link
Contributor Author

@Farber98 Farber98 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! We've found that testing this scenario without mocking is challenging because we don't have a way to force a transaction to expire in a live validator.

The only solution we've identified involves modifying some methods to simulate this behavior for the tests. However, this approach is not ideal, as it would require adding additional flows solely to make the test pass and to trigger a rebroadcast in the live validator tests.

Do you have any ideas on how to achieve this that we might be missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could potentially introduce some sort of proxy that takes the transaction and returns a transaction signature, but doesn't allow any more transactions through until it knows the first transaction blockhash has expired

Copy link
Contributor Author

@Farber98 Farber98 Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay to address this in a separate ticket under the Testing track? So we are able to get this one in before the freeze and finish the re-org work that depends on this PR to complete the implementation track

Copy link
Collaborator

@aalu1418 aalu1418 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the freeze impact the timeline? afaik that's just for production environments, not development?

also this PR touches significant portions of the most sensitive part of a solana integration, test coverage as the changes are made imo is a must

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just clarifying a few things:

  • there is a production code freeze, but that shouldn't impact development here
  • i'd like to see an attempt at a proper integration tests, but it can be time-boxed to a day or two (if it still doesn't work, then a follow-up ticket is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, thought the freeze involved not merging new code in chainlink repo, even if it was development 😅. I'll start working on these integration tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants