Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wen_restart: Fix the epoch_stakes used in calculation. #2376

Merged

Conversation

wen-coding
Copy link

We used the epoch_stakes of local root previously, this is incorrect when we are crossing Epoch boundary.

We now calculate epoch_stakes one Epoch ahead, and wen_restart can only handle up to 9 hours of outage, so we should always have all the epoch_stakes we ever need in the epoch_stakes_map of local root. Any slot further away should be discarded.

  1. Use the epoch_stakes on specific slot whenever we are doing per slot calculation
  2. When calculating whether to exit AggregateLastVotedForkSlots, for all Epoch with at least one slot which has >42% stake (this would include the Epoch local root is in, it has >66%), check if 80% is active, only exit if all of such Epochs has > 80% active. This is probably a bit restrictive, but should work in practice, because epoch_stakes shouldn't change too much from one Epoch to next, same set of active validators could quickly satisfy >80% active
  3. When calculating active set for HeaviestFork, use the Epoch our selected HeaviestForkSlot is in. Because if enough people picked a different HeaviestForkSlot, we won't be able to proceed anyway. We only need to make sure enough people picked the same HeaviestForkSlot and they are all online to complete rest of the restart with me.

@wen-coding wen-coding self-assigned this Jul 31, 2024
@wen-coding wen-coding linked an issue Jul 31, 2024 that may be closed by this pull request
Comment on lines 8 to 11
pub struct EpochStakesCache {
epoch_stakes_map: HashMap<Epoch, EpochStakes>,
epoch_schedule: EpochSchedule,
}
Copy link

@carllin carllin Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this functionality if we're gonna build this module should:

  1. live in solana runtime
  2. just replace the epoch_stakes: HashMap<Epoch, EpochStakes> in the bank
  3. Be a separate PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class used to do much more, but now that we only need epoch_schedule and epoch_stakes of root_bank, and we don't need to switch to another bank, I think it makes sense to just remove this class.

It might make sense to add node_id_to_stake to EpochStakes, let me figure out how to create EpochStakes in tests, it's so tangled with other classes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added node_id_to_stake to EpochStakes, does it look better now?

@carllin
Copy link

carllin commented Aug 3, 2024

This is probably a bit restrictive, but should work in practice, because epoch_stakes shouldn't change too much from one Epoch to next, same set of active validators could quickly satisfy >80% active

This doesn't have to be true, I think we can change by as much as 25% between epochs

Seems to make more sense to repair slots which had 42% in any of the epochs, because it's possible the slot been was optimistically confirmed in the older epoch and there are no slots optimisticallly confirmed in the next epoch

@wen-coding
Copy link
Author

wen-coding commented Aug 3, 2024 via email

@carllin
Copy link

carllin commented Aug 3, 2024

Seems to make more sense to repair slots which had 42% in any of the epochs, because it's possible the slot was optimistically confirmed in the older epoch and there are no slots optimisticallly confirmed in the next epoch

@carllin
Copy link

carllin commented Aug 3, 2024

Hmm...

  1. this would be a lot simpler if the handoff fix was guaranteed I think, then we know that any OC slot in the new epoch descends from OC slot from the previous epoch
  2. Have to take a look at the handoff fix, I think we might be able to steal some of that logic here to guarantee a safe transition. Essentially validators from the previous epoch that unstaked must participate in the wen restart

@wen-coding
Copy link
Author

wen-coding commented Aug 3, 2024 via email

@wen-coding
Copy link
Author

Hmm...

  1. this would be a lot simpler if the handoff fix was guaranteed I think, then we know that any OC slot in the new epoch descends from OC slot from the previous epoch
  2. Have to take a look at the handoff fix, I think we might be able to steal some of that logic here to guarantee a safe transition. Essentially validators from the previous epoch that unstaked must participate in the wen restart
    Hmm, even with the handoff fix, I think you still want > 80% stake in the new Epoch so that we won't be stuck in WaitForSuperMajority if the picked slot is in new Epoch?

I think the only optimization we have on the current fix is decide "on-the-fly" where the heaviest fork slot would be. If it's in the new Epoch then we don't need > 80% stake for the old Epoch (but with the handoff fix we might get stuck if we don't have > 80% for the old Epoch).

If the heaviest fork slot is in the old Epoch then we don't have to wait for > 80% stake in the new Epoch. But:

  1. it's rare enough that we have an outage when crossing Epoch boundary + there is no root in the new Epoch + there is some slot with > 42% stake in the new Epoch
  2. if we don't have > 80% stake in the new Epoch, then we will start to get stuck really soon

wen-restart/src/last_voted_fork_slots_aggregate.rs Outdated Show resolved Hide resolved
wen-restart/src/last_voted_fork_slots_aggregate.rs Outdated Show resolved Hide resolved
Comment on lines 209 to 211
// the new epoch and update the old one.
assert!(self.epoch_info_vec.first().unwrap().epoch == current_epoch);
self.epoch_info_vec.truncate(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems impossible given slots_to_repair is a btree and thus let highest_repair_slot = self.slots_to_repair.last(); is monotonically increasing, and the epoch must always be greater. I think we can panic here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, because I do self.slots_to_repair.remove(slot); above.

The logic is like this: we allow any user to replace the RestartLastVotedForkSlots they sent out (maybe they screwed up the ledger directory and tried again). So if some validator X sent out wrong Gossip messages last time claiming it voted up to slot 1000, but now replace it with only slot 500, then we have to subtract its stake from slots 500~1000, this may lead the highest slots_to_repair to go backwards.

Copy link

@carllin carllin Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a good comment to add to that case

The comment "Somehow wandering..." suggests this is an error/unforseen case

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking now we should enforce for each validator that the RestartLastVotedForkSlots cannot go down, only monotonically increase.

There should be no reason somebody posts a lower last voted forks other than corruption/malice.

In gossip we can enforce we take the latest one instead of latest one by timestamp.

Also would simplify this code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that at the beginning. But I was afraid that honest validator operators can't resolve the mistakes they had. Let's say someone has hot standby set up, one of them has a bad state voting into the far future, the other one is good. The operator made the mistake to restart the bad one first, then whatever he does, he can't revert back to the good one any more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our conversation, will send out another PR to forbid any update to RestartLastVotedForkSlots/HeaviestFork. Because that has little to do with epoch_stakes.

wen-restart/src/wen_restart.rs Show resolved Hide resolved
uint64 epoch = 1;
uint64 total_active_stake = 2;
uint64 total_stake = 3;
bool considered_during_exit = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm do we need this extra flag, can we just ignore LastVotedForkSlots that are not in the current or next epoch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this extra flag because root_epoch + 1 is always in epoch_info_vec now, but we don't always consider root_epoch + 1 for exit criteria. For example, we won't consider it at all if no one has ever voted in the next epoch.

We then need this extra flag in the proto file because the proto file is a snapshot, we can restart from this snapshot. Then we need this calculated correctly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logging purposes and cleaner implementation instead of the considered_during_exit flag, we could track for epoch N and N+1 two fields, voted_percent and voted_for_this_epoch_percent

voted_percent is the % stake of the epoch that has voted for either epoch
voted_for_this_epoch_percent is the % stake of this epoch that has voted for a slot in this epoch.

This might be useful for debugging as well if we log this information.

We then need this extra flag in the proto file because the proto file is a snapshot, we can restart from this snapshot. Then we need this calculated correctly.

This seems extractable from the bank on restart

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to voted_percent and voted_for_this_epoch_percent.

@@ -260,7 +262,7 @@ pub(crate) fn aggregate_restart_last_voted_fork_slots(
// Because all operations on the aggregate are called from this single thread, we can
// fetch all results separately without worrying about them being out of sync. We can
// also use returned iterator without the vector changing underneath us.
let active_percent = last_voted_fork_slots_aggregate.active_percent();
let active_percent = last_voted_fork_slots_aggregate.min_active_percent();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, is it possible that some validators only see the LastVotedForks updates from the old epoch before they see any last LastVotedForks from the next epoch?

So they may return thinking 80% of the validators have responded, but don't wait for the 80% from the next epoch

Seems like this should be rare, probably a non issue

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no limit on percentage of churns stakes can change at each Epoch, it's possible. But it's only an issue if:

  1. We have some slot OC'ed in the next Epoch (so 67% of the new Epoch stake confirmed it)
  2. When 80% of the old Epoch stake returned LastVotedForks, we didn't have any slot in the new Epoch with 42% of the new Epoch stake

Even if this happens, the stake of the new Epoch has already been generated at the beginning of the old Epoch, so very soon the newly restarted cluster needs to cross the Epoch boundary so at that point they figured they don't have majority of the stake in the new Epoch to proceed.

Given the current stake distribution, I think it would be rare.

});
total_active_stake as f64 / total_stake as f64 * 100.0
fn update_epoch_info_considered_during_exit(&mut self) {
let highest_repair_slot_epoch = self
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now one malicious peer voting with a much higher slot could cause the entire cluster to add the next epoch's stakes to consider.

Is there any benefit to only considering the next epoch's stakes if:

Let's say we have 100 stake in first epoch

  1. In the next epoch, the previous epoch's valdiators should have at least 91 stake (assuming worst case 9% new stake and 9% old stake cooldown).
  2. If next epoch has OC'd, 67 of next epoch's eshould have voted, of which at least 67 + 91 - 100 = 58 stake is from the previoius epoch.

Thus we could say we ignore the next epoch until at least 58% of the previous epoch's stake has voted for the next epoch. Not sure if this provides more security.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One malicious peer voting shouldn't be a problem, we only consider if any slot has >42% of the stake, that should be a lot of validators already.

This 42% limit is just very convenient because we use it for repair as well. We could change it to 58% or 38% with different tradeoffs in mind. I think your proposal provide similar safety guarantee to the current requirement (>=1 slot with >42% epoch stake), but when you say 58% you didn't count the stake outside the restart. So if we have 80% of the old epoch stake joining the restart, then we have at least (80-9)/(100+9) = 65% of the new epoch stake, 65% - 20% = 45% of the old Epoch stake could lead to us waiting for new Epoch. We could change the
> 1 slot on 42% limit to all stake >45%, but I would think it doesn't make too much difference in reality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, now I think the limit should be at least 1 slot with >=33% stake in the new Epoch.

Let's say some slot X is OC'ed in the new epoch, it has >= 67% stake in the new epoch, we can have at most 9% churn, so it should have >=58% stake in the old epoch. Let's say we have >=80% of the old Epoch in the restart, and 5% can lie, then any slot OC'ed in the new epoch should have at least 58-(100-80)-5 = 33% stake when the old Epoch has >=80% stake in restart.

That said, if we are this close to the new epoch we might very soon need 80% of the new Epoch stake to be online for the cluster to continue functioning.

I'm now thinking, in the spirit of guaranteeing if the clock has ticked into the new Epoch for enough people then we require 80% stake to be online for the new epoch, how about waiting for the new Epoch as soon as you have 1/3 of the stake voting into the new cluster? We would never have >=1/3 malicious validator, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to check that >1/3 voted for at least 1 slot in the new Epoch.

If enough people have their clocks close to the new Epoch boundary, we probably should wait for >80% stake so we can reach consensus in the new Epoch.

Comment on lines 209 to 211
// the new epoch and update the old one.
assert!(self.epoch_info_vec.first().unwrap().epoch == current_epoch);
self.epoch_info_vec.truncate(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking now we should enforce for each validator that the RestartLastVotedForkSlots cannot go down, only monotonically increase.

There should be no reason somebody posts a lower last voted forks other than corruption/malice.

In gossip we can enforce we take the latest one instead of latest one by timestamp.

Also would simplify this code

uint64 epoch = 1;
uint64 total_active_stake = 2;
uint64 total_stake = 3;
bool considered_during_exit = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logging purposes and cleaner implementation instead of the considered_during_exit flag, we could track for epoch N and N+1 two fields, voted_percent and voted_for_this_epoch_percent

voted_percent is the % stake of the epoch that has voted for either epoch
voted_for_this_epoch_percent is the % stake of this epoch that has voted for a slot in this epoch.

This might be useful for debugging as well if we log this information.

We then need this extra flag in the proto file because the proto file is a snapshot, we can restart from this snapshot. Then we need this calculated correctly.

This seems extractable from the bank on restart

@wen-coding wen-coding requested a review from carllin August 19, 2024 19:52
// If at least 1/3 of the stake has voted for a slot in next Epoch, we think
// the cluster's clock is in sync and everyone will enter the new Epoch soon.
// So we require that we have >80% stake in the new Epoch to exit.
const EPOCH_CONSIDERED_FOR_EXIT_THRESHOLD: f64 = 33.33;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this consistent as a threshold const EPOCH_CONSIDERED_FOR_EXIT_THRESHOLD: f64 = 1f64 / 3f64 nstead of a percent? Seems more in line with all the other thresholds we have

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.insert(*from, last_vote_epoch)
!= Some(last_vote_epoch)
{
self.update_epoch_info();
Copy link

@carllin carllin Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to iterate updates for the entire map here, can we just update the entry for the from validator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. But right now when we allow updates, just updating the entry for the from validator would be a bit messy, because you may need to subtract stake from some epoc_info.

Can I clear up the update logic in the following PR forbidding entry update?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is it more complicated than removing the validator's stake from the latest epoch_info they last voted for and adding to the new one??

Copy link

@carllin carllin Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see what you mean, when we prevent updates none of this matters, this is fine for now then

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah it's fine, I implemented it already. But it's going away.

}
}
let last_vote_epoch = root_bank
.get_epoch_and_slot_index(last_voted_fork_slots[0])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an assumption here last_voted_fork_slots is sorted from largest to smallest? If so would be nice to have an assert somewhere checking this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also small nit: last_vote_epoch -> my_last_vote_epoch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It should be sorted, but better use max() here to be safe.

Renamed.

}
}
let last_vote_epoch = root_bank
.get_epoch_and_slot_index(last_voted_fork_slots[0])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also small nit: last_vote_epoch -> my_last_vote_epoch

for slot in last_voted_fork_slots {
if slot >= &root_slot {
slots_stake_map.insert(*slot, sender_stake);
if let Some(sender_stake) = root_bank.epoch_node_id_to_stake(root_epoch, my_pubkey)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this stake be based on the epoch slot is in rather than the stake in the root_epoch?

Latest vote slot could be bigger than root epoch

Also nit: last_voted_fork_slots -> my_last_voted_fork_slots

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, changed. Also added a testcase.

renamed.

Comment on lines 92 to 93
voted_percent: 0.0,
voted_for_this_epoch_percent: 0.0,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this voted percent/voted_for_this_epoch_percent be updated for our own validator's vote?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pub total_stake: u64,
// Percentage (0 to 100) of total stake of active peers in this epoch,
// no matter they voted for a slot in this epoch or not.
pub voted_percent: f64,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: voted_percent -> actively_voting_stake

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 30 to 33
pub voted_percent: f64,
// Percentage (0 to 100) of total stake of active peers which has voted for
// a slot in this epoch.
pub voted_for_this_epoch_percent: f64,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to track the actively voting stake, and convert to percent when we need to by dividing by the total stake

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 158 to 160
if self.update_and_check_if_message_already_saved(new_slots, new_slots_vec) {
return None;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved before we construct the record object above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is written like this so we can avoid a new_slots.clone(). But maybe a clone isn't that bad, changed.

.insert(*from, last_vote_epoch)
!= Some(last_vote_epoch)
{
self.update_epoch_info();
Copy link

@carllin carllin Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see what you mean, when we prevent updates none of this matters, this is fine for now then

Comment on lines +183 to +205
let epoch = self.root_bank.epoch_schedule().get_epoch(*slot);
let entry = self.slots_stake_map.get_mut(slot).unwrap();
*entry = entry.saturating_sub(sender_stake);
if *entry < threshold_stake {
self.slots_to_repair.remove(slot);
if let Some(sender_stake) = self.root_bank.epoch_node_id_to_stake(epoch, from) {
*entry = entry.saturating_sub(sender_stake);
let repair_threshold_stake = (self.root_bank.epoch_total_stake(epoch).unwrap()
as f64
* self.repair_threshold) as u64;
if *entry < repair_threshold_stake {
self.slots_to_repair.remove(slot);
}
}
}
for slot in new_slots_set.difference(&old_slots_set) {
let epoch = self.root_bank.epoch_schedule().get_epoch(*slot);
let entry = self.slots_stake_map.entry(*slot).or_insert(0);
*entry = entry.saturating_add(sender_stake);
if *entry >= threshold_stake {
self.slots_to_repair.insert(*slot);
if let Some(sender_stake) = self.root_bank.epoch_node_id_to_stake(epoch, from) {
*entry = entry.saturating_add(sender_stake);
let repair_threshold_stake = (self.root_bank.epoch_total_stake(epoch).unwrap()
as f64
* self.repair_threshold) as u64;
if *entry >= repair_threshold_stake {
self.slots_to_repair.insert(*slot);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait for all of this to go away when we prevent updates :)

@wen-coding wen-coding merged commit 82aeef8 into anza-xyz:master Aug 23, 2024
51 checks passed
@wen-coding wen-coding deleted the wen_restart_use_appropriate_epoch_stakes branch August 23, 2024 21:20
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
* wen_restart: Fix the epoch_stakes used in calculation.

* Fix a bad merge.

* Remove EpochStakesCache, it only caches epoch stakes from root_bank, better to just keep root_bank around.

* Split aggregate into smaller functions.

* Switch to node_id_to_stake which is simpler.

* Rename update_slots_stake_map and switch to epoch_total_stake().

* Remove unnecessary utility functions.

* Do not modify epoch_info_vec, just init it with two epochs we will consider.

* Switch to epoch_node_id_to_stake()

* Add test for divergence at Epoch boundary.

* Make linter happy.

* - wait for the new Epoch if > 1/3 of the validators voted for some
  slot in the new Epoch
- switch to voted_percent and voted_for_this_epoch_percent

* Fix a bad merge.

* Fix a bad merge.

* Change constant format.

* Do not loop through the whole table.

* Address reviewer feedback.

* Address reviewer comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Update the epoch stake in use in wen_restart
2 participants