fix flakyness errors #1927

tudor-malene · 2024-05-21T16:03:07Z

Why this change is needed

Our simulation tests are flaky.

What changes were made as part of this PR

the main change is a fix to a fork creation bug which causes batches not to be marked as canonical properly
another fix is around a very subtle corner case where 2 consecutive forks cancel each other
there are 2 more unrelated fixes

PR checks pre-merging

Please indicate below by ticking the checkbox that you have read and performed the required
PR checks

PR checks reviewed and performed

tudor-malene · 2024-05-21T16:04:46Z

integration/simulation/validate_chain.go

@@ -778,12 +778,13 @@ func checkTotalTransactions(t *testing.T, client rpc.Client, nodeIdx int) {
 // Checks that we can retrieve the latest batches
 func checkForLatestBatches(t *testing.T, client rpc.Client, nodeIdx int) {
 	var latestBatches common.BatchListingResponseDeprecated
-	pagination := common.QueryPagination{Offset: uint64(0), Size: uint(5)}
+	pagination := common.QueryPagination{Offset: uint64(0), Size: uint(20)}


there is a flakyness here where this was returning less than a page

tudor-malene · 2024-05-21T16:09:45Z

go/host/storage/hostdb/batch.go

@@ -37,6 +38,9 @@ func AddBatch(dbtx *dbTransaction, statements *SQLStatements, batch *common.ExtB
 		extBatch,                     // ext_batch
 	)
 	if err != nil {
+		if strings.Contains(strings.ToLower(err.Error()), "unique") {


this fixes the bug that returned the wrong type of error when it failed with the constraint bug

tudor-malene · 2024-05-21T16:10:17Z

integration/simulation/simulation.go

 	// as a lot of the hardcodes were giving way too little and choking the gas payments
-	allocObsWallets := big.NewInt(0).Mul(big.NewInt(100), big.NewInt(gethparams.Ether))
+	allocObsWallets := big.NewInt(0).Mul(big.NewInt(1000), big.NewInt(gethparams.Ether))


I saw failures where the accounts were running out of gas

…flake

tudor-malene · 2024-05-23T07:37:47Z

go/common/gethutil/gethutil.go

-	if len(ncp) > 0 {
-		ncp = ncp[0 : len(ncp)-1]
-	}
+	b, cp, ncp, err := internalLCA(ctx, newCanonical, oldCanonical, resolver, []common.L1BlockHash{}, []common.L1BlockHash{})


this was not returningn the new hew head block on the canonical path.
Which was having some consequences when setting the "is_canonical" flag, when a fork was cancelled out

tudor-malene · 2024-05-23T07:38:05Z

go/enclave/components/batch_executor.go

@@ -144,7 +144,7 @@ func (executor *batchExecutor) ComputeBatch(ctx context.Context, context *BatchE
 	}

 	// These variables will be used to create the new batch
-	parent, err := executor.storage.FetchBatch(ctx, context.ParentPtr)
+	parentBatch, err := executor.storage.FetchBatch(ctx, context.ParentPtr)


this is just a rename for clarity

tudor-malene · 2024-05-23T07:39:03Z

go/enclave/components/batch_registry.go

@@ -100,12 +100,12 @@ func (br *batchRegistry) OnBatchExecuted(batch *core.Batch, receipts types.Recei
 }

 func (br *batchRegistry) HasGenesisBatch() (bool, error) {
-	return br.headBatchSeq != nil, nil
+	return br.HeadBatchSeq() != nil, nil


going through the function instead of accessing the variable directly.
It was useful to check whether that cached variable is correct

tudor-malene · 2024-05-23T07:43:13Z

go/enclave/components/block_processor.go

@@ -100,29 +100,27 @@ func (bp *l1BlockProcessor) HealthCheck() (bool, error) {
 func (bp *l1BlockProcessor) tryAndInsertBlock(ctx context.Context, br *common.BlockAndReceipts) (*BlockIngestionType, error) {
 	block := br.Block

-	_, err := bp.storage.FetchBlock(ctx, block.Hash())


there is a legitimate reason why a block might be resubmitted to the enclave. When it has become the "head" again.
Before we were just ignoring it, and waited for the next block to confirm the (re)fork.

I'm not sure this is legitimate? Which block is canonical is determined more than anything by the blocks getting built on top of it, I think we could safely ignore any 'flickering' of head L1 blocks (and not sure if it happens in ethereum anyway).

To me it feels safe for the enclave to only update what is canonical L1 when it receives a new L1 block. WDYT?

The host submits only canonical blocks into the enclave. That's the contract, afaik.
So this is basically the first block it has to send when it goes back to an old branch

Hmm yeah good point. So it doesn't fully reprocess I guess, it just updates its head batch and maybe handles any fallout from that.

But then it must have been broken before so I guess we've never had a fork on sepolia...

Oh ok, so the guardian deals with that currently, so there wasn't a bug. Host only sends canonical blocks. If enclave has already seen it then it moves onto the next one.

if strings.Contains(err.Error(), errutil.ErrBlockAlreadyProcessed.Error()) { // we have already processed this block, let's try the next canonical block // this is most common when we are returning to a previous fork and the enclave has already seen some of the blocks on it // note: logging this because we don't expect it to happen often and would like visibility on that. g.logger.Info("L1 block already processed by enclave, trying the next block", "block", block.Hash()) nextHeight := big.NewInt(0).Add(block.Number(), big.NewInt(1)) nextCanonicalBlock, err := g.sl.L1Repo().FetchBlockByHeight(nextHeight) ...

My worry is:

we introduce some confusion into the contract, because now we don't reject replaying a re-canonical block but we also don't require it afaik.

doing the 'ingest' twice on the same block might be a route for bugs in the future, it was quite easy to reason about when we could say the enclave would never process a block with the same hash twice.

What happened before was that the next block built on top of the "previous head/previous noncanonica/current head" block would be regarded as a "fork" with a "canonical" back-chain.
That would update the batches accordingly.
It was just doing that with a delay.

Yeah fair enough. I think we should get rid of that block of logic from the guardian then and agree from now on that the enclave will expect blocks to be resubmitted if they become canonical again.

I think that's the behaviour of the etheruem rpc.
But ,the good thing is that the enclave does not depend on that. It can handle either case, as long as it has all the blocks

BedrockSquirrel · 2024-05-23T07:44:09Z

go/host/enclave/guardian.go

@@ -439,6 +439,17 @@ func (g *Guardian) submitL1Block(block *common.L1Block, isLatest bool) (bool, er
 			}
 			return g.submitL1Block(nextCanonicalBlock, isLatest)
 		}
+		// todo - Matt - is this the right way to do this?
+		// the issue is that sometimes a block was skipped and never resent
+		if strings.Contains(err.Error(), errutil.ErrBlockAncestorNotFound.Error()) {


This shouldn't be necessary, the L1 catch-up logic handles this when it calls L1Repo().FetchNextBlock(enclaveHead) the L1 repo finds the LCA of L1 canonical head and the enclave's reported head.

tudor-malene · 2024-05-23T08:02:14Z

go/enclave/storage/enclavedb/batch.go

@@ -176,7 +204,7 @@ func ReadCurrentHeadBatch(ctx context.Context, db *sql.DB) (*core.Batch, error)
 }

 func ReadBatchesByBlock(ctx context.Context, db *sql.DB, hash common.L1BlockHash) ([]*core.Batch, error) {
-	return fetchBatches(ctx, db, " join block l1b on b.l1_proof=l1b.id where l1b.hash=?  order by b.sequence", hash.Bytes())
+	return fetchBatches(ctx, db, " where l1_proof_hash=?  order by b.sequence", hash.Bytes())


the join was not necessary

tudor-malene · 2024-05-23T08:02:47Z

go/enclave/storage/enclavedb/block.go

@@ -31,23 +33,7 @@ func WriteBlock(ctx context.Context, dbtx *sql.Tx, b *types.Header) error {
 	return err
 }

-func UpdateCanonicalBlocks(ctx context.Context, dbtx *sql.Tx, canonical []common.L1BlockHash, nonCanonical []common.L1BlockHash, logger gethlog.Logger) error {


the logic here was just moved to the storage

tudor-malene · 2024-05-23T08:04:32Z

go/host/storage/storage.go

+		if err1 := dbtx.Rollback(); err1 != nil {
+			return err1
+		}
+		if errors.Is(err, errutil.ErrAlreadyExists) {


this returns the "ErrAlreadyExists", which is part of the logical API of this method.
So that the client can handle when a batch is inserted again

…flake

tudor-malene · 2024-05-24T09:27:06Z

integration/simulation/simulation_in_mem_test.go

@@ -36,7 +36,7 @@ func TestInMemoryMonteCarloSimulation(t *testing.T) {
 		IsInMem:                    true,
 		L1TenData:                  &params.L1TenData{},
 		ReceiptTimeout:             5 * time.Second,
-		StoppingDelay:              4 * time.Second,
+		StoppingDelay:              15 * time.Second,


turns out there was not enough time for the non-p2p node to catch up under some circumstances.

tudor-malene · 2024-05-24T09:27:37Z

integration/simulation/simulation_in_mem_test.go

@@ -26,8 +26,8 @@ func TestInMemoryMonteCarloSimulation(t *testing.T) {
 	simParams := params.SimParams{
 		NumberOfNodes: numberOfNodes,
 		//  todo (#718) - try reducing this back to 50 milliseconds once faster-finality model is optimised
-		AvgBlockDuration:           250 * time.Millisecond,
-		SimulationTime:             30 * time.Second,
+		AvgBlockDuration:           180 * time.Millisecond,


the lower this is, the more races should surface

BedrockSquirrel

LGTM I think - the duplicateBatches part is the only bit I struggle to wrap my head around, maybe @StefanIliev545 wants to look that over if he's more familiar.

fix flakyness errors

d82a8b8

tudor-malene commented May 21, 2024

View reviewed changes

tudor-malene added 3 commits May 21, 2024 17:32

fix flakyness errors

b08c281

Merge remote-tracking branch 'refs/remotes/origin/main' into tudor/de…

e16a005

…flake

fix flakyness errors

13d18c2

tudor-malene commented May 23, 2024

View reviewed changes

BedrockSquirrel reviewed May 23, 2024

View reviewed changes

tudor-malene commented May 23, 2024

View reviewed changes

tudor-malene added 4 commits May 23, 2024 09:05

better comment

8cbe01f

fixes

68b8293

fixes

268615a

Merge remote-tracking branch 'refs/remotes/origin/main' into tudor/de…

8433387

…flake

tudor-malene commented May 24, 2024

View reviewed changes

BedrockSquirrel approved these changes May 24, 2024

View reviewed changes

tudor-malene merged commit b97d0b9 into main May 24, 2024
2 checks passed

tudor-malene deleted the tudor/deflake branch May 24, 2024 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flakyness errors #1927

fix flakyness errors #1927

tudor-malene commented May 21, 2024 •

edited

Loading

tudor-malene May 21, 2024

tudor-malene May 21, 2024

tudor-malene May 21, 2024

tudor-malene May 23, 2024

tudor-malene May 23, 2024

tudor-malene May 23, 2024

tudor-malene May 23, 2024 •

edited

Loading

BedrockSquirrel May 23, 2024

tudor-malene May 23, 2024

BedrockSquirrel May 23, 2024

BedrockSquirrel May 23, 2024 •

edited

Loading

tudor-malene May 23, 2024

BedrockSquirrel May 23, 2024

tudor-malene May 24, 2024

BedrockSquirrel May 23, 2024

tudor-malene May 24, 2024

tudor-malene May 23, 2024

tudor-malene May 23, 2024

tudor-malene May 23, 2024

tudor-malene May 24, 2024

tudor-malene May 24, 2024

BedrockSquirrel left a comment

fix flakyness errors #1927

fix flakyness errors #1927

Conversation

tudor-malene commented May 21, 2024 • edited Loading

Why this change is needed

What changes were made as part of this PR

PR checks pre-merging

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tudor-malene May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel left a comment

Choose a reason for hiding this comment

tudor-malene commented May 21, 2024 •

edited

Loading

tudor-malene May 23, 2024 •

edited

Loading

BedrockSquirrel May 23, 2024 •

edited

Loading