-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flakyness errors #1927
fix flakyness errors #1927
Conversation
@@ -778,12 +778,13 @@ func checkTotalTransactions(t *testing.T, client rpc.Client, nodeIdx int) { | |||
// Checks that we can retrieve the latest batches | |||
func checkForLatestBatches(t *testing.T, client rpc.Client, nodeIdx int) { | |||
var latestBatches common.BatchListingResponseDeprecated | |||
pagination := common.QueryPagination{Offset: uint64(0), Size: uint(5)} | |||
pagination := common.QueryPagination{Offset: uint64(0), Size: uint(20)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a flakyness here where this was returning less than a page
@@ -37,6 +38,9 @@ func AddBatch(dbtx *dbTransaction, statements *SQLStatements, batch *common.ExtB | |||
extBatch, // ext_batch | |||
) | |||
if err != nil { | |||
if strings.Contains(strings.ToLower(err.Error()), "unique") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fixes the bug that returned the wrong type of error when it failed with the constraint bug
// as a lot of the hardcodes were giving way too little and choking the gas payments | ||
allocObsWallets := big.NewInt(0).Mul(big.NewInt(100), big.NewInt(gethparams.Ether)) | ||
allocObsWallets := big.NewInt(0).Mul(big.NewInt(1000), big.NewInt(gethparams.Ether)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw failures where the accounts were running out of gas
if len(ncp) > 0 { | ||
ncp = ncp[0 : len(ncp)-1] | ||
} | ||
b, cp, ncp, err := internalLCA(ctx, newCanonical, oldCanonical, resolver, []common.L1BlockHash{}, []common.L1BlockHash{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was not returningn the new hew head block on the canonical path.
Which was having some consequences when setting the "is_canonical" flag, when a fork was cancelled out
@@ -144,7 +144,7 @@ func (executor *batchExecutor) ComputeBatch(ctx context.Context, context *BatchE | |||
} | |||
|
|||
// These variables will be used to create the new batch | |||
parent, err := executor.storage.FetchBatch(ctx, context.ParentPtr) | |||
parentBatch, err := executor.storage.FetchBatch(ctx, context.ParentPtr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a rename for clarity
@@ -100,12 +100,12 @@ func (br *batchRegistry) OnBatchExecuted(batch *core.Batch, receipts types.Recei | |||
} | |||
|
|||
func (br *batchRegistry) HasGenesisBatch() (bool, error) { | |||
return br.headBatchSeq != nil, nil | |||
return br.HeadBatchSeq() != nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going through the function instead of accessing the variable directly.
It was useful to check whether that cached variable is correct
@@ -100,29 +100,27 @@ func (bp *l1BlockProcessor) HealthCheck() (bool, error) { | |||
func (bp *l1BlockProcessor) tryAndInsertBlock(ctx context.Context, br *common.BlockAndReceipts) (*BlockIngestionType, error) { | |||
block := br.Block | |||
|
|||
_, err := bp.storage.FetchBlock(ctx, block.Hash()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a legitimate reason why a block might be resubmitted to the enclave. When it has become the "head" again.
Before we were just ignoring it, and waited for the next block to confirm the (re)fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is legitimate? Which block is canonical is determined more than anything by the blocks getting built on top of it, I think we could safely ignore any 'flickering' of head L1 blocks (and not sure if it happens in ethereum anyway).
To me it feels safe for the enclave to only update what is canonical L1 when it receives a new L1 block. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The host submits only canonical blocks into the enclave. That's the contract, afaik.
So this is basically the first block it has to send when it goes back to an old branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah good point. So it doesn't fully reprocess I guess, it just updates its head batch and maybe handles any fallout from that.
But then it must have been broken before so I guess we've never had a fork on sepolia...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, so the guardian deals with that currently, so there wasn't a bug. Host only sends canonical blocks. If enclave has already seen it then it moves onto the next one.
if strings.Contains(err.Error(), errutil.ErrBlockAlreadyProcessed.Error()) {
// we have already processed this block, let's try the next canonical block
// this is most common when we are returning to a previous fork and the enclave has already seen some of the blocks on it
// note: logging this because we don't expect it to happen often and would like visibility on that.
g.logger.Info("L1 block already processed by enclave, trying the next block", "block", block.Hash())
nextHeight := big.NewInt(0).Add(block.Number(), big.NewInt(1))
nextCanonicalBlock, err := g.sl.L1Repo().FetchBlockByHeight(nextHeight)
...
My worry is:
-
we introduce some confusion into the contract, because now we don't reject replaying a re-canonical block but we also don't require it afaik.
-
doing the 'ingest' twice on the same block might be a route for bugs in the future, it was quite easy to reason about when we could say the enclave would never process a block with the same hash twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened before was that the next block built on top of the "previous head/previous noncanonica/current head" block would be regarded as a "fork" with a "canonical" back-chain.
That would update the batches accordingly.
It was just doing that with a delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair enough. I think we should get rid of that block of logic from the guardian then and agree from now on that the enclave will expect blocks to be resubmitted if they become canonical again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the behaviour of the etheruem rpc.
But ,the good thing is that the enclave does not depend on that. It can handle either case, as long as it has all the blocks
go/host/enclave/guardian.go
Outdated
@@ -439,6 +439,17 @@ func (g *Guardian) submitL1Block(block *common.L1Block, isLatest bool) (bool, er | |||
} | |||
return g.submitL1Block(nextCanonicalBlock, isLatest) | |||
} | |||
// todo - Matt - is this the right way to do this? | |||
// the issue is that sometimes a block was skipped and never resent | |||
if strings.Contains(err.Error(), errutil.ErrBlockAncestorNotFound.Error()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, the L1 catch-up logic handles this when it calls L1Repo().FetchNextBlock(enclaveHead)
the L1 repo finds the LCA of L1 canonical head and the enclave's reported head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
@@ -176,7 +204,7 @@ func ReadCurrentHeadBatch(ctx context.Context, db *sql.DB) (*core.Batch, error) | |||
} | |||
|
|||
func ReadBatchesByBlock(ctx context.Context, db *sql.DB, hash common.L1BlockHash) ([]*core.Batch, error) { | |||
return fetchBatches(ctx, db, " join block l1b on b.l1_proof=l1b.id where l1b.hash=? order by b.sequence", hash.Bytes()) | |||
return fetchBatches(ctx, db, " where l1_proof_hash=? order by b.sequence", hash.Bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the join was not necessary
@@ -31,23 +33,7 @@ func WriteBlock(ctx context.Context, dbtx *sql.Tx, b *types.Header) error { | |||
return err | |||
} | |||
|
|||
func UpdateCanonicalBlocks(ctx context.Context, dbtx *sql.Tx, canonical []common.L1BlockHash, nonCanonical []common.L1BlockHash, logger gethlog.Logger) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic here was just moved to the storage
if err1 := dbtx.Rollback(); err1 != nil { | ||
return err1 | ||
} | ||
if errors.Is(err, errutil.ErrAlreadyExists) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this returns the "ErrAlreadyExists", which is part of the logical API of this method.
So that the client can handle when a batch is inserted again
@@ -36,7 +36,7 @@ func TestInMemoryMonteCarloSimulation(t *testing.T) { | |||
IsInMem: true, | |||
L1TenData: ¶ms.L1TenData{}, | |||
ReceiptTimeout: 5 * time.Second, | |||
StoppingDelay: 4 * time.Second, | |||
StoppingDelay: 15 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out there was not enough time for the non-p2p node to catch up under some circumstances.
@@ -26,8 +26,8 @@ func TestInMemoryMonteCarloSimulation(t *testing.T) { | |||
simParams := params.SimParams{ | |||
NumberOfNodes: numberOfNodes, | |||
// todo (#718) - try reducing this back to 50 milliseconds once faster-finality model is optimised | |||
AvgBlockDuration: 250 * time.Millisecond, | |||
SimulationTime: 30 * time.Second, | |||
AvgBlockDuration: 180 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lower this is, the more races should surface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I think - the duplicateBatches part is the only bit I struggle to wrap my head around, maybe @StefanIliev545 wants to look that over if he's more familiar.
Why this change is needed
Our simulation tests are flaky.
What changes were made as part of this PR
PR checks pre-merging
Please indicate below by ticking the checkbox that you have read and performed the required
PR checks