Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flakyness errors #1927

Merged
merged 8 commits into from
May 24, 2024
Merged

fix flakyness errors #1927

merged 8 commits into from
May 24, 2024

Conversation

tudor-malene
Copy link
Collaborator

@tudor-malene tudor-malene commented May 21, 2024

Why this change is needed

Our simulation tests are flaky.

What changes were made as part of this PR

  • the main change is a fix to a fork creation bug which causes batches not to be marked as canonical properly
  • another fix is around a very subtle corner case where 2 consecutive forks cancel each other
  • there are 2 more unrelated fixes

PR checks pre-merging

Please indicate below by ticking the checkbox that you have read and performed the required
PR checks

  • PR checks reviewed and performed

@@ -778,12 +778,13 @@ func checkTotalTransactions(t *testing.T, client rpc.Client, nodeIdx int) {
// Checks that we can retrieve the latest batches
func checkForLatestBatches(t *testing.T, client rpc.Client, nodeIdx int) {
var latestBatches common.BatchListingResponseDeprecated
pagination := common.QueryPagination{Offset: uint64(0), Size: uint(5)}
pagination := common.QueryPagination{Offset: uint64(0), Size: uint(20)}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a flakyness here where this was returning less than a page

@@ -37,6 +38,9 @@ func AddBatch(dbtx *dbTransaction, statements *SQLStatements, batch *common.ExtB
extBatch, // ext_batch
)
if err != nil {
if strings.Contains(strings.ToLower(err.Error()), "unique") {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes the bug that returned the wrong type of error when it failed with the constraint bug

// as a lot of the hardcodes were giving way too little and choking the gas payments
allocObsWallets := big.NewInt(0).Mul(big.NewInt(100), big.NewInt(gethparams.Ether))
allocObsWallets := big.NewInt(0).Mul(big.NewInt(1000), big.NewInt(gethparams.Ether))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw failures where the accounts were running out of gas

if len(ncp) > 0 {
ncp = ncp[0 : len(ncp)-1]
}
b, cp, ncp, err := internalLCA(ctx, newCanonical, oldCanonical, resolver, []common.L1BlockHash{}, []common.L1BlockHash{})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not returningn the new hew head block on the canonical path.
Which was having some consequences when setting the "is_canonical" flag, when a fork was cancelled out

@@ -144,7 +144,7 @@ func (executor *batchExecutor) ComputeBatch(ctx context.Context, context *BatchE
}

// These variables will be used to create the new batch
parent, err := executor.storage.FetchBatch(ctx, context.ParentPtr)
parentBatch, err := executor.storage.FetchBatch(ctx, context.ParentPtr)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a rename for clarity

@@ -100,12 +100,12 @@ func (br *batchRegistry) OnBatchExecuted(batch *core.Batch, receipts types.Recei
}

func (br *batchRegistry) HasGenesisBatch() (bool, error) {
return br.headBatchSeq != nil, nil
return br.HeadBatchSeq() != nil, nil
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going through the function instead of accessing the variable directly.
It was useful to check whether that cached variable is correct

@@ -100,29 +100,27 @@ func (bp *l1BlockProcessor) HealthCheck() (bool, error) {
func (bp *l1BlockProcessor) tryAndInsertBlock(ctx context.Context, br *common.BlockAndReceipts) (*BlockIngestionType, error) {
block := br.Block

_, err := bp.storage.FetchBlock(ctx, block.Hash())
Copy link
Collaborator Author

@tudor-malene tudor-malene May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a legitimate reason why a block might be resubmitted to the enclave. When it has become the "head" again.
Before we were just ignoring it, and waited for the next block to confirm the (re)fork.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is legitimate? Which block is canonical is determined more than anything by the blocks getting built on top of it, I think we could safely ignore any 'flickering' of head L1 blocks (and not sure if it happens in ethereum anyway).

To me it feels safe for the enclave to only update what is canonical L1 when it receives a new L1 block. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host submits only canonical blocks into the enclave. That's the contract, afaik.
So this is basically the first block it has to send when it goes back to an old branch

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah good point. So it doesn't fully reprocess I guess, it just updates its head batch and maybe handles any fallout from that.

But then it must have been broken before so I guess we've never had a fork on sepolia...

Copy link
Collaborator

@BedrockSquirrel BedrockSquirrel May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, so the guardian deals with that currently, so there wasn't a bug. Host only sends canonical blocks. If enclave has already seen it then it moves onto the next one.

		if strings.Contains(err.Error(), errutil.ErrBlockAlreadyProcessed.Error()) {
			// we have already processed this block, let's try the next canonical block
			// this is most common when we are returning to a previous fork and the enclave has already seen some of the blocks on it
			// note: logging this because we don't expect it to happen often and would like visibility on that.
			g.logger.Info("L1 block already processed by enclave, trying the next block", "block", block.Hash())
			nextHeight := big.NewInt(0).Add(block.Number(), big.NewInt(1))
			nextCanonicalBlock, err := g.sl.L1Repo().FetchBlockByHeight(nextHeight)
			...

My worry is:

  1. we introduce some confusion into the contract, because now we don't reject replaying a re-canonical block but we also don't require it afaik.

  2. doing the 'ingest' twice on the same block might be a route for bugs in the future, it was quite easy to reason about when we could say the enclave would never process a block with the same hash twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened before was that the next block built on top of the "previous head/previous noncanonica/current head" block would be regarded as a "fork" with a "canonical" back-chain.
That would update the batches accordingly.
It was just doing that with a delay.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair enough. I think we should get rid of that block of logic from the guardian then and agree from now on that the enclave will expect blocks to be resubmitted if they become canonical again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the behaviour of the etheruem rpc.
But ,the good thing is that the enclave does not depend on that. It can handle either case, as long as it has all the blocks

@@ -439,6 +439,17 @@ func (g *Guardian) submitL1Block(block *common.L1Block, isLatest bool) (bool, er
}
return g.submitL1Block(nextCanonicalBlock, isLatest)
}
// todo - Matt - is this the right way to do this?
// the issue is that sometimes a block was skipped and never resent
if strings.Contains(err.Error(), errutil.ErrBlockAncestorNotFound.Error()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, the L1 catch-up logic handles this when it calls L1Repo().FetchNextBlock(enclaveHead) the L1 repo finds the LCA of L1 canonical head and the enclave's reported head.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@@ -176,7 +204,7 @@ func ReadCurrentHeadBatch(ctx context.Context, db *sql.DB) (*core.Batch, error)
}

func ReadBatchesByBlock(ctx context.Context, db *sql.DB, hash common.L1BlockHash) ([]*core.Batch, error) {
return fetchBatches(ctx, db, " join block l1b on b.l1_proof=l1b.id where l1b.hash=? order by b.sequence", hash.Bytes())
return fetchBatches(ctx, db, " where l1_proof_hash=? order by b.sequence", hash.Bytes())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the join was not necessary

@@ -31,23 +33,7 @@ func WriteBlock(ctx context.Context, dbtx *sql.Tx, b *types.Header) error {
return err
}

func UpdateCanonicalBlocks(ctx context.Context, dbtx *sql.Tx, canonical []common.L1BlockHash, nonCanonical []common.L1BlockHash, logger gethlog.Logger) error {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here was just moved to the storage

if err1 := dbtx.Rollback(); err1 != nil {
return err1
}
if errors.Is(err, errutil.ErrAlreadyExists) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns the "ErrAlreadyExists", which is part of the logical API of this method.
So that the client can handle when a batch is inserted again

@@ -36,7 +36,7 @@ func TestInMemoryMonteCarloSimulation(t *testing.T) {
IsInMem: true,
L1TenData: &params.L1TenData{},
ReceiptTimeout: 5 * time.Second,
StoppingDelay: 4 * time.Second,
StoppingDelay: 15 * time.Second,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out there was not enough time for the non-p2p node to catch up under some circumstances.

@@ -26,8 +26,8 @@ func TestInMemoryMonteCarloSimulation(t *testing.T) {
simParams := params.SimParams{
NumberOfNodes: numberOfNodes,
// todo (#718) - try reducing this back to 50 milliseconds once faster-finality model is optimised
AvgBlockDuration: 250 * time.Millisecond,
SimulationTime: 30 * time.Second,
AvgBlockDuration: 180 * time.Millisecond,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lower this is, the more races should surface

Copy link
Collaborator

@BedrockSquirrel BedrockSquirrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I think - the duplicateBatches part is the only bit I struggle to wrap my head around, maybe @StefanIliev545 wants to look that over if he's more familiar.

@tudor-malene tudor-malene merged commit b97d0b9 into main May 24, 2024
2 checks passed
@tudor-malene tudor-malene deleted the tudor/deflake branch May 24, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants