Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Poll the main node for batch to vote on (BFT-496) #161

Merged
merged 9 commits into from
Jul 31, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Jul 26, 2024

What ❔

Poll the main node for which batch height to vote on.

  • Return an Option rather than a Vec of BatchQC from BatchVotes::find_quorum: we decided not to implement voting on multiple heights for now, which makes it just confusing that a vector is returned.
  • Create an AttestationStatusClient trait with a method to poll the next BatchNumber to vote on, which can be injected as a dyn dependency
  • Create an AttestationStatusWatch that contains the next BatchNumber to vote on; this is updated by a single background task and listened to by multiple subscribers
  • Create an AttestationStatusRunner which is supposed to be started along with the BatchStore in zksync-era and polls the client; the point of this is to push out the poll interval configuration to the edge and make the AttestationStatusWatch the integration point
  • Change AttesterRunner to wait for the AttestationStatusWatch to change, then the payload to appear, then cast the vote; it doesn't need to worry about reverts because the node will restart if that happens
  • Change run_batch_qc_finder to wait for the AttestationStatusWatch to change, then wait for the next available QC, and save it; this might produce gaps which is normal on the external nodes, but undersireable on the main node - on the main node it only produces gaps if the attesters vote on higher heights despite the main node telling them not to, which if they are majority honest should not happen, and if they are not then what can we do anyway
  • Rename PersistentBatchStore::earliest_batch_number_to_sign to next_batch_to_attest in accordance with the new method on consensus_dal.
  • Initialise BatchNumber::min_batch_number to by asking the main node where to resume from This initialisation isn't necessary because 1) we decided that we'll only allow 1 vote per attester, so there is no attack vector of casting votes from batch 0 and 2) run_batch_qc_finder first waits for the API status to appear and then prunes all preceding data, so an older QC will not be saved. It was also undesirable to have to initialise from an API that might return nothing (or errors) for an unknown amount of time.

Poll interval

The AttesterStatusRunner polls the API at fixed intervals without any exponential backoff. I thought 5s would be a reasonable default value. With 60s batches this seems fine because the next batch to sign will be signalled as soon as the current batch QC is inserted into the database, which is well ahead of time of when the next batch and its commitment will be available. If we think we'll need to catch up with historical batch signatures it might be a bit slow. We generally still have to account for gossiping around the votes though, which in a large enough network could be on the order of several seconds as well.

Potential deadlock

There is a pathological sequence of events that could prevent the feature from starting on a fresh database with no prior batch QCs:

  1. Say we start from batch 0, and a QC is gathered, but saving it to the database (which is an async process) takes a long time on the main node.
  2. Say it takes so long that batch 1 and batch 2 are created, but none of them have a QC yet.
  3. Say we have two attesters A and B, and A polled the API when it showed batch 2, and the main node set its minimum batch number in the vote registry to batch 2 as well.
  4. Now the QC for batch 0 is saved to the database and the API indicates the next one to vote on is batch 1.
  5. Attester B casts its vote on batch 1 but it goes ignored by the main node vote registry because its looking for a quorum with at least batch number 2.
  6. Having missed an attestation the main node will never save a QC and never move on until it's restarted.

Note that even if the registry minimum batch number was adjusted down that might happen after the vote for batch 1 has already been discarded, and because new votes are only pushed once through gossip there is no guarantee that it will get it again.

The solution is coming in the form of fixing the starting point of gossip to be where the genesis starts, even if that means filling a potentially long history of batches in the beginning. See matter-labs/zksync-era#2539

Why ❔

We want to collect batch QCs without gaps, and for now put the main node in charge of what to vote on. The API to serve this information is in matter-labs/zksync-era#2480 and this PR is the follow up to make polling that information part of the signing process.

Base automatically changed from bft-498-batch-genesis to main July 29, 2024 13:48
@aakoshh aakoshh force-pushed the bft-496-poll-batch-to-sign branch 7 times, most recently from 51fdc19 to 9f52b8a Compare July 29, 2024 22:00
@aakoshh aakoshh force-pushed the bft-496-poll-batch-to-sign branch from 9f52b8a to 2821962 Compare July 29, 2024 22:36
@aakoshh aakoshh marked this pull request as ready for review July 29, 2024 22:39
@aakoshh aakoshh requested a review from pompon0 as a code owner July 29, 2024 22:39
@aakoshh aakoshh changed the title BFT-496: Return at most one quorum for now. BFT-496: Poll the main node for batch to vote on Jul 30, 2024
@aakoshh aakoshh force-pushed the bft-496-poll-batch-to-sign branch from eeb6b64 to b3d75bf Compare July 30, 2024 13:07
@aakoshh aakoshh force-pushed the bft-496-poll-batch-to-sign branch from f566938 to 47042e4 Compare July 30, 2024 13:43
@aakoshh aakoshh changed the title BFT-496: Poll the main node for batch to vote on feat: Poll the main node for batch to vote on (BFT-496) Jul 30, 2024
@aakoshh aakoshh force-pushed the bft-496-poll-batch-to-sign branch from 6b85c29 to 45f8d9b Compare July 30, 2024 15:04
Copy link
Collaborator

@pompon0 pompon0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thx

@aakoshh aakoshh merged commit bce5e3e into main Jul 31, 2024
5 checks passed
@aakoshh aakoshh deleted the bft-496-poll-batch-to-sign branch July 31, 2024 11:54
aakoshh added a commit that referenced this pull request Jul 31, 2024
…FT-496) (#165)

## What ❔

`AttestationStatusWatch` must be initialised with a `BatchNumber`, it
cannot have `None` any more. `AttestationStatusRunner::new` was replaced
with the `AttestationStatusRunner::init` method which asynchronously
polls the API until the first value is returned, and then returns itself
along with the `AttestationStatusWatch` it created. This can then be
passed to the `Executor`, while the `AttestationStatusRunner::run` will
keep the status up to date in the background.

## Why ❔

In the review of #161
it was observed that the `Executor` can wait until this data is
available. In theory it is only unavailable if the main node API is
down, in which case an external node couldn't pull Genesis either and
would probably fail during startup, or if the Genesis itself is still
under construction in the database, which is a transient state under
which an external node as mentioned would not start, and apparently the
main node doesn't need the `Executor` to get over it. By removing `None`
as an option for `next_batch_to_attest` the state is easier to reason
about.
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Aug 1, 2024
## What ❔

Injects an `AttestationStatusWatch` into the `Executor` which on the
main node is backed by an `AttestationStatusRunner` polling the
`BatchStore::next_batch_to_attest` and external nodes call the
`MainNodeApi::fetch_attestation_status`.

TODO: 
- [x] Rebase after #2539
is merged
- [x] Update the `era-consensus` dependency once
matter-labs/era-consensus#161 is merged and
`0.1.0-rc.5` is published

### Test failure - pruning the main node

The following test never finished:
```shell
zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture
```
A little extra logging revealed that the `AttesterStatusRunner` never
gets initialised, because the blocks get pruned earlier than it could
read the result, ie. `ConsensusDal::batch_of_block(genesis.first_block)`
probably returns `None` because it's not found, and never will because
it's pruned.

We discussed in #2480 that
the main node is not supposed to be pruned, which is why the SQL that
looks for what to attest on doesn't look for pruning records any more.
Yet this test prunes the main node, and if I just try to prune the
external node instead it panics because it doesn't have that block yet.

Either the SQL should take pruning into account after all, or we have to
figure out a way to wait in the test until the main node executor is
running, or change the test to prune the external node; I did the
latter.
 
## Why ❔

This is coordinating attesters to cast their votes on the batch which
the main node tells them to do to produce a quorum certificate for all
L1 batches without gaps.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants