Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation fetching fairness #4880

Merged
merged 158 commits into from
Dec 13, 2024
Merged

Collation fetching fairness #4880

merged 158 commits into from
Dec 13, 2024

Conversation

tdimitrov
Copy link
Contributor

@tdimitrov tdimitrov commented Jun 26, 2024

Related to #1797

The problem

When fetching collations in collator protocol/validator side we need to ensure that each parachain has got a fair core time share depending on its assignments in the claim queue. This means that the number of collations fetched per parachain should ideally be equal to (but definitely not bigger than) the number of claims for the particular parachain in the claim queue.

Why the current implementation is not good enough

The current implementation doesn't guarantee such fairness. For each relay parent there is a waiting_queue (PerRelayParent -> Collations -> waiting_queue) which holds any unfetched collations advertised to the validator. The collations are fetched on first in first out principle which means that if two parachains share a core and one of the parachains is more aggressive it might starve the second parachain. How? At each relay parent up to max_candidate_depth candidates are accepted (enforced in fn is_seconded_limit_reached) so if one of the parachains is quick enough to fill in the queue with its advertisements the validator will never fetch anything from the rest of the parachains despite they are scheduled. This doesn't mean that the aggressive parachain will occupy all the core time (this is guaranteed by the runtime) but it will deny the rest of the parachains sharing the same core to have collations backed.

How to fix it

The solution I am proposing is to limit fetches and advertisements based on the state of the claim queue. At each relay parent the claim queue for the core assigned to the validator is fetched. For each parachain a fetch limit is calculated (equal to the number of entries in the claim queue). Advertisements are not fetched for a parachain which has exceeded its claims in the claim queue. This solves the problem with aggressive parachains advertising too much collations.

The second part is in collation fetching logic. The collator will keep track on which collations it has fetched so far. When a new collation needs to be fetched instead of popping the first entry from the waiting_queue the validator examines the claim queue and looks for the earliest claim which hasn't got a corresponding fetch. This way the collator will always try to prioritise the most urgent entries.

How the 'fair share of coretime' for each parachain is determined?

Thanks to async backing we can accept more than one candidate per relay parent (with some constraints). We also have got the claim queue which gives us a hint which parachain will be scheduled next on each core. So thanks to the claim queue we can determine the maximum number of claims per parachain.

For example the claim queue is [A A A] at relay parent X so we know that at relay parent X we can accept three candidates for parachain A. There are two things to consider though:

  1. If we accept more than one candidate at relay parent X we are claiming the slot of a future relay parent. So accepting two candidates for relay parent X means that we are claiming the slot at rp X+1 or rp X+2.
  2. At the same time the slot at relay parent X could have been claimed by a previous relay parent(s). This means that we need to accept less candidates at X or even no candidates.

There are a few cases worth considering:

  1. Slot claimed by previous relay parent.
    CQ @ rp X: [A A A]
    Advertisements at X-1 for para A: 2
    Advertisements at X-2 for para A: 2
    Outcome - at rp X we can accept only 1 advertisement since our slots were already claimed.
  2. Slot in our claim queue already claimed at future relay parent
    CQ @ rp X: [A A A]
    Advertisements at X+1 for para A: 1
    Advertisements at X+2 for para A: 1
    Outcome: at rp X we can accept only 1 advertisement since the slots in our relay parents were already claimed.

The situation becomes more complicated with multiple leaves (forks). Imagine we have got a fork at rp X:

CQ @ rp X: [A A A]
(rp X) -> (rp X+1) -> rp(X+2)
         \-> (rp X+1')

Now when we examine the claim queue at RP X we need to consider both forks. This means that accepting a candidate at X means that we should have a slot for it in BOTH leaves. If for example there are three candidates accepted at rp X+1' we can't accept any candidates at rp X because there will be no slot for it in one of the leaves.

How the claims are counted

There are two solutions for counting the claims at relay parent X:

  1. Keep a state for the claim queue (number of claims and which of them are claimed) and look it up when accepting a collation. With this approach we need to keep the state up to date with each new advertisement and each new leaf update.
  2. Calculate the state of the claim queue on the fly at each advertisement. This way we rebuild the state of the claim queue at each advertisements.

Solution 1 is hard to implement with forks. There are too many variants to keep track of (different state for each leaf) and at the same time we might never need to use them. So I decided to go with option 2 - building claim queue state on the fly.

To achieve this I've extended View from backing_implicit_view to keep track of the outer leaves. I've also added a method which accepts a relay parent and return all paths from an outer leaf to it. Let's call it paths_to_relay_parent.

So how the counting works for relay parent X? First we examine the number of seconded and pending advertisements (more on pending in a second) from relay parent X to relay parent X-N (inclusive) where N is the length of the claim queue. Then we use paths_to_relay_parent to obtain all paths from outer leaves to relay parent X. We calculate the claims at relay parents X+1 to X+N (inclusive) for each leaf and get the maximum value. This way we guarantee that the candidate at rp X can be included in each leaf. This is the state of the claim queue which we use to decide if we can fetch one more advertisement at rp X or not.

What is a pending advertisement

I mentioned that we count seconded and pending advertisements at relay parent X. A pending advertisement is:

  1. An advertisement which is being fetched right now.
  2. An advertisement pending validation at backing subsystem.
  3. An advertisement blocked for seconding by backing because we don't know on of its parent heads.

Any of these is considered a 'pending fetch' and a slot for it is kept. All of them are already tracked in State.

@tdimitrov tdimitrov added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Jun 26, 2024
@tdimitrov tdimitrov force-pushed the tsv-collator-proto-fairness branch from c7f24aa to 0f28aa8 Compare June 28, 2024 08:19
Copy link
Contributor

@Overkillus Overkillus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and clear PR description and a much welcome refactor, GJ

I added some comments and questions, but nothing major.


Slot in our claim queue already claimed at future relay parent
CQ @ rp X: [A A A]
Advertisements at X+1 for para A: 1
Advertisements at X+2 for para A: 1
Outcome: at rp X we can accept only 1 advertisement since the slots in our relay parents were already claimed.

For my own curiosity but when would this happen? What order of collator adverts leads to this scenario?

(X->X+2 means collation anchored at rp X but claiming the slot at X+2)

  • Collator A builds collations for X->X, X->X+1, X->X+2
  • Collator A in time sends the collation X->X to next Collator B
  • Collator A then crashes and does not send X->X+1,X->X+2 to Collator B
  • Collator A becaue he crashed also does not send his collations to validators
  • Collator B since he has not seen X->X+1 or X->X+2, builds his own X+1->X+1, but is lazy so he does not make X+1->X+2
  • Collator B sends X+1->X+1 to Collator C and to validators
  • Validators fetch X+1->X+1 even though they have not seen any collations for slot X
  • Collator C also has not seen X->X+2 but he seen X+1->X+1 so he builds X+2->X+2
  • Collator C sends X+2->X+2 to validators
  • Validators fetch X+2->X+2
  • Collator A awakens from his crash and starts sending his collations X->X, X->X+1, X->X+2 to everyone
  • Other collators see it but it is too late already, they built their own for X+1 and X+2
  • Validators finally receive X->X but discard X->X+1 and X->X+2 since they already fetched collations X+1->X+1 and X+2->X+2

Is there a simpler scenario? 🤔

self.future_blocks.push_back(ClaimInfo {
hash: None,
claim: Some(*expected_claim),
claim_queue_len: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set claim_queue_len: 1 for future blocks? How should this valie be interpreted?

For instance ClaimQueueState:
block_state: A
future_blocks: B, C

I'd expect A to have CQ len 3, B 2 and C 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking you are right but I think it's unneeded complexity.

For simplicity I use the same type (ClaimInfo) for block_state and future_blocks. Ideally they should be different and claim_queue_len should not exist in future_blocks. But since I am chaining block_state and future_blocks together I went for the same type. So my choice was either to have an Option or to set it to 1 (because there is one claim we know at this spot). Option doesn't bring much benefit besides lots of boilerplate code so I went for setting it to 1.

This is okay because we can't build on an unknown relay parent so this value won't be examined in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about why 1.

polkadot/node/subsystem-util/src/backing_implicit_view.rs Outdated Show resolved Hide resolved
polkadot/node/subsystem-util/src/backing_implicit_view.rs Outdated Show resolved Hide resolved
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12116652607
Failed job name: fmt

@tdimitrov
Copy link
Contributor Author

Is there a simpler scenario? 🤔

What about a collator builds two collations but due to weird network conditions the are delivered in the wrong order at the validator?

Or two different collators build two candidates at different relay parents and again due to weird network conditions the validator gets them out of order?

@tdimitrov
Copy link
Contributor Author

bot fmt

@command-bot
Copy link

command-bot bot commented Dec 2, 2024

@tdimitrov https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7849396 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-21e90747-eefe-4641-8533-1f51c7667ce0 to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Dec 2, 2024

@tdimitrov Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7849396 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7849396/artifacts/download.

Copy link
Contributor

@Overkillus Overkillus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from "space in all paths" to "space in at least one path" and that was my last concern so LGTM!

@tdimitrov tdimitrov added this pull request to the merge queue Dec 13, 2024
github-merge-queue bot pushed a commit that referenced this pull request Dec 13, 2024
Related to #1797

# The problem
When fetching collations in collator protocol/validator side we need to
ensure that each parachain has got a fair core time share depending on
its assignments in the claim queue. This means that the number of
collations fetched per parachain should ideally be equal to (but
definitely not bigger than) the number of claims for the particular
parachain in the claim queue.

# Why the current implementation is not good enough
The current implementation doesn't guarantee such fairness. For each
relay parent there is a `waiting_queue` (PerRelayParent -> Collations ->
waiting_queue) which holds any unfetched collations advertised to the
validator. The collations are fetched on first in first out principle
which means that if two parachains share a core and one of the
parachains is more aggressive it might starve the second parachain. How?
At each relay parent up to `max_candidate_depth` candidates are accepted
(enforced in `fn is_seconded_limit_reached`) so if one of the parachains
is quick enough to fill in the queue with its advertisements the
validator will never fetch anything from the rest of the parachains
despite they are scheduled. This doesn't mean that the aggressive
parachain will occupy all the core time (this is guaranteed by the
runtime) but it will deny the rest of the parachains sharing the same
core to have collations backed.

# How to fix it
The solution I am proposing is to limit fetches and advertisements based
on the state of the claim queue. At each relay parent the claim queue
for the core assigned to the validator is fetched. For each parachain a
fetch limit is calculated (equal to the number of entries in the claim
queue). Advertisements are not fetched for a parachain which has
exceeded its claims in the claim queue. This solves the problem with
aggressive parachains advertising too much collations.

The second part is in collation fetching logic. The collator will keep
track on which collations it has fetched so far. When a new collation
needs to be fetched instead of popping the first entry from the
`waiting_queue` the validator examines the claim queue and looks for the
earliest claim which hasn't got a corresponding fetch. This way the
collator will always try to prioritise the most urgent entries.

## How the 'fair share of coretime' for each parachain is determined?
Thanks to async backing we can accept more than one candidate per relay
parent (with some constraints). We also have got the claim queue which
gives us a hint which parachain will be scheduled next on each core. So
thanks to the claim queue we can determine the maximum number of claims
per parachain.

For example the claim queue is [A A A] at relay parent X so we know that
at relay parent X we can accept three candidates for parachain A. There
are two things to consider though:
1. If we accept more than one candidate at relay parent X we are
claiming the slot of a future relay parent. So accepting two candidates
for relay parent X means that we are claiming the slot at rp X+1 or rp
X+2.
2. At the same time the slot at relay parent X could have been claimed
by a previous relay parent(s). This means that we need to accept less
candidates at X or even no candidates.

There are a few cases worth considering:
1. Slot claimed by previous relay parent.
    CQ @ rp X: [A A A]
    Advertisements at X-1 for para A: 2
    Advertisements at X-2 for para A: 2
Outcome - at rp X we can accept only 1 advertisement since our slots
were already claimed.
2. Slot in our claim queue already claimed at future relay parent
    CQ @ rp X: [A A A]
    Advertisements at X+1 for para A: 1
    Advertisements at X+2 for para A: 1
Outcome: at rp X we can accept only 1 advertisement since the slots in
our relay parents were already claimed.

The situation becomes more complicated with multiple leaves (forks).
Imagine we have got a fork at rp X:
```
CQ @ rp X: [A A A]
(rp X) -> (rp X+1) -> rp(X+2)
         \-> (rp X+1')
```
Now when we examine the claim queue at RP X we need to consider both
forks. This means that accepting a candidate at X means that we should
have a slot for it in *BOTH* leaves. If for example there are three
candidates accepted at rp X+1' we can't accept any candidates at rp X
because there will be no slot for it in one of the leaves.

## How the claims are counted
There are two solutions for counting the claims at relay parent X:
1. Keep a state for the claim queue (number of claims and which of them
are claimed) and look it up when accepting a collation. With this
approach we need to keep the state up to date with each new
advertisement and each new leaf update.
2. Calculate the state of the claim queue on the fly at each
advertisement. This way we rebuild the state of the claim queue at each
advertisements.

Solution 1 is hard to implement with forks. There are too many variants
to keep track of (different state for each leaf) and at the same time we
might never need to use them. So I decided to go with option 2 -
building claim queue state on the fly.

To achieve this I've extended `View` from backing_implicit_view to keep
track of the outer leaves. I've also added a method which accepts a
relay parent and return all paths from an outer leaf to it. Let's call
it `paths_to_relay_parent`.

So how the counting works for relay parent X? First we examine the
number of seconded and pending advertisements (more on pending in a
second) from relay parent X to relay parent X-N (inclusive) where N is
the length of the claim queue. Then we use `paths_to_relay_parent` to
obtain all paths from outer leaves to relay parent X. We calculate the
claims at relay parents X+1 to X+N (inclusive) for each leaf and get the
maximum value. This way we guarantee that the candidate at rp X can be
included in each leaf. This is the state of the claim queue which we use
to decide if we can fetch one more advertisement at rp X or not.

## What is a pending advertisement
I mentioned that we count seconded and pending advertisements at relay
parent X. A pending advertisement is:
1. An advertisement which is being fetched right now.
2. An advertisement pending validation at backing subsystem.
3. An advertisement blocked for seconding by backing because we don't
know on of its parent heads.

Any of these is considered a 'pending fetch' and a slot for it is kept.
All of them are already tracked in `State`.

---------

Co-authored-by: Maciej <[email protected]>
Co-authored-by: command-bot <>
Co-authored-by: Alin Dima <[email protected]>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 13, 2024
@tdimitrov tdimitrov enabled auto-merge December 13, 2024 08:27
@tdimitrov tdimitrov added this pull request to the merge queue Dec 13, 2024
Merged via the queue into master with commit 5153e2b Dec 13, 2024
195 of 200 checks passed
@tdimitrov tdimitrov deleted the tsv-collator-proto-fairness branch December 13, 2024 09:36
dudo50 pushed a commit to paraspell-research/polkadot-sdk that referenced this pull request Jan 4, 2025
Related to paritytech#1797

# The problem
When fetching collations in collator protocol/validator side we need to
ensure that each parachain has got a fair core time share depending on
its assignments in the claim queue. This means that the number of
collations fetched per parachain should ideally be equal to (but
definitely not bigger than) the number of claims for the particular
parachain in the claim queue.

# Why the current implementation is not good enough
The current implementation doesn't guarantee such fairness. For each
relay parent there is a `waiting_queue` (PerRelayParent -> Collations ->
waiting_queue) which holds any unfetched collations advertised to the
validator. The collations are fetched on first in first out principle
which means that if two parachains share a core and one of the
parachains is more aggressive it might starve the second parachain. How?
At each relay parent up to `max_candidate_depth` candidates are accepted
(enforced in `fn is_seconded_limit_reached`) so if one of the parachains
is quick enough to fill in the queue with its advertisements the
validator will never fetch anything from the rest of the parachains
despite they are scheduled. This doesn't mean that the aggressive
parachain will occupy all the core time (this is guaranteed by the
runtime) but it will deny the rest of the parachains sharing the same
core to have collations backed.

# How to fix it
The solution I am proposing is to limit fetches and advertisements based
on the state of the claim queue. At each relay parent the claim queue
for the core assigned to the validator is fetched. For each parachain a
fetch limit is calculated (equal to the number of entries in the claim
queue). Advertisements are not fetched for a parachain which has
exceeded its claims in the claim queue. This solves the problem with
aggressive parachains advertising too much collations.

The second part is in collation fetching logic. The collator will keep
track on which collations it has fetched so far. When a new collation
needs to be fetched instead of popping the first entry from the
`waiting_queue` the validator examines the claim queue and looks for the
earliest claim which hasn't got a corresponding fetch. This way the
collator will always try to prioritise the most urgent entries.

## How the 'fair share of coretime' for each parachain is determined?
Thanks to async backing we can accept more than one candidate per relay
parent (with some constraints). We also have got the claim queue which
gives us a hint which parachain will be scheduled next on each core. So
thanks to the claim queue we can determine the maximum number of claims
per parachain.

For example the claim queue is [A A A] at relay parent X so we know that
at relay parent X we can accept three candidates for parachain A. There
are two things to consider though:
1. If we accept more than one candidate at relay parent X we are
claiming the slot of a future relay parent. So accepting two candidates
for relay parent X means that we are claiming the slot at rp X+1 or rp
X+2.
2. At the same time the slot at relay parent X could have been claimed
by a previous relay parent(s). This means that we need to accept less
candidates at X or even no candidates.

There are a few cases worth considering:
1. Slot claimed by previous relay parent.
    CQ @ rp X: [A A A]
    Advertisements at X-1 for para A: 2
    Advertisements at X-2 for para A: 2
Outcome - at rp X we can accept only 1 advertisement since our slots
were already claimed.
2. Slot in our claim queue already claimed at future relay parent
    CQ @ rp X: [A A A]
    Advertisements at X+1 for para A: 1
    Advertisements at X+2 for para A: 1
Outcome: at rp X we can accept only 1 advertisement since the slots in
our relay parents were already claimed.

The situation becomes more complicated with multiple leaves (forks).
Imagine we have got a fork at rp X:
```
CQ @ rp X: [A A A]
(rp X) -> (rp X+1) -> rp(X+2)
         \-> (rp X+1')
```
Now when we examine the claim queue at RP X we need to consider both
forks. This means that accepting a candidate at X means that we should
have a slot for it in *BOTH* leaves. If for example there are three
candidates accepted at rp X+1' we can't accept any candidates at rp X
because there will be no slot for it in one of the leaves.

## How the claims are counted
There are two solutions for counting the claims at relay parent X:
1. Keep a state for the claim queue (number of claims and which of them
are claimed) and look it up when accepting a collation. With this
approach we need to keep the state up to date with each new
advertisement and each new leaf update.
2. Calculate the state of the claim queue on the fly at each
advertisement. This way we rebuild the state of the claim queue at each
advertisements.

Solution 1 is hard to implement with forks. There are too many variants
to keep track of (different state for each leaf) and at the same time we
might never need to use them. So I decided to go with option 2 -
building claim queue state on the fly.

To achieve this I've extended `View` from backing_implicit_view to keep
track of the outer leaves. I've also added a method which accepts a
relay parent and return all paths from an outer leaf to it. Let's call
it `paths_to_relay_parent`.

So how the counting works for relay parent X? First we examine the
number of seconded and pending advertisements (more on pending in a
second) from relay parent X to relay parent X-N (inclusive) where N is
the length of the claim queue. Then we use `paths_to_relay_parent` to
obtain all paths from outer leaves to relay parent X. We calculate the
claims at relay parents X+1 to X+N (inclusive) for each leaf and get the
maximum value. This way we guarantee that the candidate at rp X can be
included in each leaf. This is the state of the claim queue which we use
to decide if we can fetch one more advertisement at rp X or not.

## What is a pending advertisement
I mentioned that we count seconded and pending advertisements at relay
parent X. A pending advertisement is:
1. An advertisement which is being fetched right now.
2. An advertisement pending validation at backing subsystem.
3. An advertisement blocked for seconding by backing because we don't
know on of its parent heads.

Any of these is considered a 'pending fetch' and a slot for it is kept.
All of them are already tracked in `State`.

---------

Co-authored-by: Maciej <[email protected]>
Co-authored-by: command-bot <>
Co-authored-by: Alin Dima <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Jan 23, 2025
This is the right value after
#4880, which corresponds
to an allowedAncestryLen of 2 (which is the default)

WIll fix #7105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants