-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NONEVM-706][Solana] - Refactor TXM + Rebroadcast Expired Tx functionality #946
base: develop
Are you sure you want to change the base?
Conversation
// rebroadcastExpiredTxs attempts to rebroadcast all transactions that are in broadcasted state and have expired. | ||
// An expired tx is one where it's blockhash lastValidBlockHeight (last valid block number) is smaller than the current block height (block number). | ||
// If any error occurs during rebroadcast attempt, they are discarded, and the function continues with the next transaction. | ||
func (txm *Txm) rebroadcastExpiredTxs(ctx context.Context, client client.ReaderWriter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like there's a potential race condition here
- tx is broadcast to the network and is included just before blockhash expiration
- confirmer does not see it yet because it waits for
confirmed
commitment level - TXM marks a base transaction as expired because the block threshold has passed
- deletes the saved tx id from the in-memory storage, and rebroadcasts a tx with a new blockhash
- this could go on forever because it can never confirm that a successful tx is confirmed because the tx.id is always deleted before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be handled gracefully by the existing code?
The processConfirmations
method is called before rebroadcastExpiredTxs
updating each signature status. Here we would detect if we have transitioned from Broadcasted to Processed.
Then inside rebroadcastExpiredTxs
we call ListAllExpiredBroadcastedTxs
, where we only consider the ones which have Broadcasted
state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would avoid that scenario since we only call rebroadcastExpiredTxs
after processing the latest confirmation status changes. Since this method only processes Broadcasted
transaction, the confirmation logic would handle any last minute state changes before we assess transaction for expiry. So in your scenario, we would notice that the transaction advanced to Processed
, mark it accordingly in our in-memory storage, then move on to the expiry check. The expiry check would notice there aren't any transactions in Broadcasted
state that are expired and just move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if a transaction skips the processed
phase?
processed
is just the view of a single RPC node - so if a transaction can go from processed
to non-existent, then it might also be possible to go from non-existent to confirmed
it just feels odd to delete all previous attempts at a transaction and then recreate - rather than just updating the transaction with a new blockhash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would advance the state from non-existent (internally what we consider Broadcasted
) to Confirmed
and also avoid rebroadcasting with a new hash in that case. Our state transition methods don't require us to move one step up at a time in case of slow polling or other reasons. We can even jump straight to Finalized
. We only delete all previous attempts/sigs if we can confirm that the tx has a non-existent state and its blockhash has expired. I was thinking it was better to clean up those old attempts/sigs than keep them around and waste cycles on status checks on them since they're expired anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - the chances of this race happening might be low
is there logging around a transaction signature that doesn't match any tx id? something that fires if an assumption is broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one of the state transition methods gets a sig that doesn't match to an ID in the internal map, it should return a ErrSigDoesNotExist
error. Depending where the method was called, we have some logging to surface an assumption violation. But not in all cases like the confirmation expiration where we can maybe assume a different process cleaned up the signatures.
@@ -1186,3 +1209,432 @@ func addSigAndLimitToTx(t *testing.T, keystore SimpleKeystore, pubkey solana.Pub | |||
require.NoError(t, fees.SetComputeUnitLimit(&txCopy, limit)) | |||
return &txCopy | |||
} | |||
|
|||
func TestTxm_ExpirationRebroadcast(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests only runs against a mocked endpoints under ideal behavior - please add additional tests that run against a live validator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! We've found that testing this scenario without mocking is challenging because we don't have a way to force a transaction to expire in a live validator.
The only solution we've identified involves modifying some methods to simulate this behavior for the tests. However, this approach is not ideal, as it would require adding additional flows solely to make the test pass and to trigger a rebroadcast in the live validator tests.
Do you have any ideas on how to achieve this that we might be missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could potentially introduce some sort of proxy that takes the transaction and returns a transaction signature, but doesn't allow any more transactions through until it knows the first transaction blockhash has expired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be okay to address this in a separate ticket under the Testing track? So we are able to get this one in before the freeze and finish the re-org work that depends on this PR to complete the implementation track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the freeze impact the timeline? afaik that's just for production environments, not development?
also this PR touches significant portions of the most sensitive part of a solana integration, test coverage as the changes are made imo is a must
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just clarifying a few things:
- there is a production code freeze, but that shouldn't impact development here
- i'd like to see an attempt at a proper integration tests, but it can be time-boxed to a day or two (if it still doesn't work, then a follow-up ticket is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, thought the freeze involved not merging new code in chainlink repo
, even if it was development 😅. I'll start working on these integration tests
409fd1c
Quality Gate passedIssues Measures |
Description
Refactor
confirm
andsendWithRetry
funcsAdding a new
TxExpirationRebroadcast
toggable routine:confirm
loop, after checking confirmations, we will determine if we need to rebroadcast expired transactions that where not processed yet. Rebroadcasting a transaction involves:sendWithRetry
directly without enqueuinghttps://smartcontract-it.atlassian.net/browse/NONEVM-706
prev branch: #928
Merge together with:
Soak Tests: