Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFT-475: Add BatchVotePublisher and expose it to the Executor #137

Merged
merged 4 commits into from
Jun 27, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Jun 26, 2024

What ❔

Adds a BatchVotePublisher which will be used in the Executor to push L1 batch votes to the gossip layer.

Why ❔

So that we can insert the nodes attestations over L1 batches into vote collection, which automatically gets gossiped to connected peers in Network::run_stream.

@aakoshh aakoshh changed the title BFT-475: Add BatchVotesWatch to the Executor BFT-475: Add BatchVotePublisher and expose it to the Executor Jun 26, 2024
@aakoshh aakoshh marked this pull request as ready for review June 26, 2024 18:45
@aakoshh aakoshh requested a review from pompon0 as a code owner June 26, 2024 18:45
@aakoshh aakoshh force-pushed the bft-475-gossip-attestation branch from 39da8a6 to ecda871 Compare June 26, 2024 18:47
node/actors/executor/src/lib.rs Show resolved Hide resolved
@aakoshh aakoshh merged commit 1778814 into main Jun 27, 2024
5 checks passed
@aakoshh aakoshh deleted the bft-475-gossip-attestation branch June 27, 2024 09:42
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Jul 8, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Jul 8, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Jul 8, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Jul 8, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Jul 8, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.

---------

Co-authored-by: Bruno França <[email protected]>
github-merge-queue bot pushed a commit to matter-labs/zksync-era that referenced this pull request Jul 9, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.

---------

Co-authored-by: Bruno França <[email protected]>
irnb pushed a commit to vianetwork/via-server that referenced this pull request Jul 12, 2024
## What ❔

- [x] Add an `l1_batches_consensus` table to hold [L1 batch Quorum
Certificates](https://github.com/matter-labs/era-consensus/blob/177881457f392fca990dbb3df1695737d90fd0c7/node/libs/roles/src/attester/messages/batch.rs#L67)
from Attesters
- [x] Add attesters to the config
- [x] Implement methods in `PersistentBatchStore` 
  - [x] `persisted`
  - [x] `last_batch`
  - [x] `last_batch_qc`
  - [x] `get_batch`
  - [x] `get_batch_qc`
  - [x] `store_qc`
  - [ ] `queue_next_batch` - _not going to implement for now_
  - [ ] assign `SyncBatch::proof` - _not going to implement for now_
- [x] Add tests for all new methods in `ConsensusDal` and the
`PersistentBatchStore`

### Caveat

Implemented the updating of `persisted` with a loop that polls the
database for newly available `SyncBatch` records, even if they have no
proof. This inevitably triggers the gossiping of batch statuses and the
pulling of `SyncBatch` between peers. For this reason `queue_next_batch`
just drop the data, since we can't do anything with it without the proof
yet. Returning an error or panicking would stop the consensus tasks.

I ended up disabling the `persisted` by leaving its dummy implementation
in place because when enabled the full node tests keep going on forever,
printing the following logs in a loop:

```console
❯ RUST_LOG=info zk test rust test_full_nodes --no-capture
...
2024-07-03T14:22:57.882784Z  INFO in{addr=[::1]:53082}: zksync_consensus_network: 191: new connection
2024-07-03T14:22:57.883457Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::gossip::runner: 383: peer = node:public:ed25519:068ffa0b3fedbbe5c2a6da3defd26e0d084248f12bfe98db85f7785b0b08b63e
2024-07-03T14:22:57.883764Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::gossip::runner: 416: peer = node:public:ed25519:7710ed90aad9f5859dfba06e13fb4e6fb0fe4d686f81f9d819464ad1fdc371bd
2024-07-03T14:22:57.886204Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886280Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: message too large: max = 10240B, got 13773B
2024-07-03T14:22:57.886633Z  INFO in{addr=[::1]:53082}:gossip: zksync_consensus_network::rpc: 222: canceled
...
2024-07-03T14:22:57.888143Z  INFO out{addr="[::1]:52998"}:gossip: zksync_consensus_network::rpc: 222: disconnected
...
2024-07-03T14:22:57.888390Z  INFO zksync_consensus_network: 216: [::1]:53082: gossip.run_inbound_stream(): push_batch_store_state.: end of stream
2024-07-03T14:22:57.888446Z  INFO zksync_consensus_network: 158: gossip.run_outbound_stream("[::1]:52998"): push_batch_store_state.: end of stream
```

So in the tests the message size exceeds the maximum. I think it's
[hardcoded
here](https://github.com/matter-labs/era-consensus/blob/decb988eb9e1a45fd5171d2cc540a360d9ca5f1f/node/actors/network/src/gossip/runner.rs#L109).
Since this functionality isn't expected to work, I think we can disable
it for now.

## Why ❔

The workflow of signing and submitting L1 batch certificates will be
like this:
1. Data is inserted into the `l1_batches` table. 
2. If the node is one of the Attesters it picks up the batch, signs and
sends it to the gossip layer via
matter-labs/era-consensus#137
3. The consensus collects votes about the L1 batch, and when the
threshold is reached it saves the quorum certificate into Postgres
4. The node monitors Main Node (later L1) for new batch QCs and upserts
them into the database (the QC can be different than what a particular
node inserted based on gossip). This way a node which has been down for
a period of time can backfill any QCs it missed. It is assumed that the
Main Node API only serves QCs that have no gaps following them, ie. they
are final - if it was L1 it wouldn't allow submissions with gaps, and
this simulates that semantic.
5. The last height that doesn't have any gaps following it is used as a
floor for what needs to be (re)signed and gossiped

This PR supports the above workflow up to step 3.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.

---------

Co-authored-by: Bruno França <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants