Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: account for time already elapsed when waiting after the commit #965

Merged
merged 23 commits into from
Apr 20, 2023

Conversation

evan-forbes
Copy link
Member

@evan-forbes evan-forbes commented Feb 7, 2023

Description

This PR uses dynamic timeout commits, meaning that we account for time that has already elapsed before we decide how long to wait after the commit is finalized. This effectively adds a "target round duration", where we don't wait the same amount of time for the timeout commit arbitrarily each round. Instead, we subtract the time elapsed between the round's start time, and the commit time.

Note that this will likely have yet unmeasured consequences in the application. For instance, if we somehow end up not waiting at all for remaining votes, then those votes won't get included in the commit, and validators that otherwise would have gotten counted as signing (and thus gotten rewards), will not be counted.

part of #939 and celestiaorg/celestia-app#1340

@evan-forbes evan-forbes added this to the Mainnet milestone Feb 7, 2023
@evan-forbes evan-forbes self-assigned this Feb 7, 2023
@evan-forbes
Copy link
Member Author

marking ready for review to begin accepting comments. while this change is small and seems to be working here and in the app, we should expect other consequences.

this has also only been tested in a contrived testnets, we should ideally test this in testground before including in arabica and mocha

@evan-forbes evan-forbes marked this pull request as ready for review February 7, 2023 15:00
config/config.go Outdated
TimeoutCommit: 1000 * time.Millisecond,
TargetRoundDuration: 3500 * time.Millisecond,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to better names for this

Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking feedback. IIUC it isn't possible for tendermint to guarantee a consistent block interval but this change tries to target a more consistent block time.

Relevant context: tendermint/tendermint#5911 and https://docs.tendermint.com/v0.34/tendermint-core/configuration.html#consensus-timeouts-explained

config/config.go Outdated Show resolved Hide resolved
config/config.go Show resolved Hide resolved
config/config.go Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
config/config.go Show resolved Hide resolved
config/config.go Show resolved Hide resolved
config/config.go Outdated
// from upstream tendermint to be dynamic, where we account for the amount of
// time already taken in a round.
func (cfg *ConsensusConfig) Commit(t time.Time, elapsed time.Duration) time.Time {
remaining := cfg.TargetRoundDuration - elapsed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a reasonable, simple change to me. Would it make sense to:

  • test this in a live network or at least in testground
  • have informal review this as well
    before we merge this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's a good idea. We were trying to test this with testground this morning, but didn't have time to finish.

We might not want to include this change in the incentivized testnet, but we should push and test in mocha as soon as we can get data back from testground.

One side thought: after playing around with this, I noticed that if we increase the timeouts too long, then validators begin to miss producing blocks, even on the e2e test. After discussing with @cmwaters, when the timeouts are long, we think there is a significantly higher probability validators are progressing through consensus at different rates, and thus potentially missing rounds entirely. This would actually explain what we are seeing in mocha quite well. Having more consistent round times, potentially via this mechansim, but likely via similar arbitary waiting mechanisms to more closely follow the configured timeouts instead of moving to the next step in consensus after reaching 2/3s, could be effective in helping us come to consensus on the first block when we expect.

rootulp
rootulp previously approved these changes Feb 7, 2023
Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to think through this a bit and I think it all makes sense. If you really wanted regular block times, we should implement proposer based timestamps, but this should get us something in the range. We already have metrics for block times right? So would be good to track them.

config/config.go Outdated
@@ -932,7 +932,7 @@ type ConsensusConfig struct {
// height (this gives us a chance to receive some more precommits, even
// though we already have +2/3).
// NOTE: when modifying, make sure to update time_iota_ms genesis parameter
TimeoutCommit time.Duration `mapstructure:"timeout_commit"`
TargetRoundDuration time.Duration `mapstructure:"target_round_duration"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really per round but per height because we only commit once per height. What this means of course is that any multiround height will likely be above 15 seconds and instantly jump to the next height meaning only 2/3 voting power will be included in the commit (i.e. there will be no time for straggler voters). We should be wary of this if we are rewarding validators on a per-height basis for voting involvement. Alternatively we could reward validators for involvement across 100 heights (instead of just one). Or we could introduce another parameter which states a minimum "wait for straggler votes" period.

Actually, It might be fine if there's also a non-negligible time taken in FinalizeBlock and PrepareProposal. If it takes 1-2 seconds, then that's another 1 to 2 seconds that more votes (above the 2/3 line) can be received in.

We also might not care to have commits with all signatures.

Conclusion: I think we should rename this variable to TargetHeightDuration and we should make sure we are measuring the percentage of voting power in commits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really per round but per height because we only commit once per height.

ahh I see! good catch. I wrongly assumed that since the start time is in RoundState it would get updated each round. Renamed as well

We should be wary of this if we are rewarding validators on a per-height basis for voting involvement.

I was worried about this as well (briefly mentioned in top comment) as iirc, there was somewhere in the sdk where we're rewarding validators per signature. I looked again though and couldn't find anything, so it might be obfuscated. Regardless, the number of signatures included should be something that we collect data on in testground, and might even want to force this to occur if it isn't happening naturally.

config/config.go Show resolved Hide resolved
config/toml.go Outdated
Comment on lines 461 to 463
# How long we wait after committing a block, before starting on the new
# height (this gives us a chance to receive some more precommits, even
# though we already have +2/3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extend the comments here

consensus/state.go Outdated Show resolved Hide resolved
@evan-forbes
Copy link
Member Author

I still need to debug why the e2e test is failing after using a simpler timeout, but we also shouldn't merge this in until we collect more data on this change's impact via testground, so I'm converting to a draft

@evan-forbes evan-forbes marked this pull request as draft February 15, 2023 02:22
DOCKER/docker-entrypoint.sh Outdated Show resolved Hide resolved
config/toml.go Outdated Show resolved Hide resolved
@@ -457,12 +457,9 @@ timeout_prevote = "1s"
timeout_prevote_delta = "500ms"
timeout_precommit = "1s"
timeout_precommit_delta = "500ms"
timeout_commit = "1s"
target_height_duration = "1s"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] these are just docs so the value doesn't matter but proposal for it to match the 15s on line 354

Suggested change
target_height_duration = "1s"
target_height_duration = "15s"

config/config.go Outdated Show resolved Hide resolved
Comment on lines 703 to 705
cs.eventCollector.WritePoint("consensus", map[string]interface{}{
"round_data": []interface{}{rs.Height, rs.Round, rs.Step},
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to add this?

cmwaters
cmwaters previously approved these changes Mar 24, 2023
Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

test/maverick/consensus/state.go Outdated Show resolved Hide resolved
test/maverick/consensus/state.go Outdated Show resolved Hide resolved
@evan-forbes
Copy link
Member Author

evan-forbes commented Apr 17, 2023

forgot to update this with tests that were run on testground also had 14-16s block times, without having any heights that required more than a single round of consensus, so I think we're clear to try this on a testnet.

I'll try to get a pretty graph and compare it to a control to see if the standard deviation of blocktimes is the same. afaict, testground tests are not surfacing the some of the same issues that we see on a larger scale testnet such as mocha, so I wouldn't be surprised if this still has issues there.

Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@evan-forbes evan-forbes merged commit cc1bc3f into v0.34.x-celestia Apr 20, 2023
@evan-forbes evan-forbes deleted the evan/dynamic-timeout-commits branch April 20, 2023 11:48
evan-forbes added a commit that referenced this pull request Jul 11, 2023
evan-forbes added a commit that referenced this pull request Jul 11, 2023
cmwaters pushed a commit that referenced this pull request Jul 20, 2023
cmwaters pushed a commit that referenced this pull request Jul 24, 2023
cmwaters pushed a commit that referenced this pull request Jul 27, 2023
cmwaters pushed a commit that referenced this pull request Jul 27, 2023
@faddat faddat mentioned this pull request Feb 22, 2024
3 tasks
cmwaters added a commit that referenced this pull request Jun 7, 2024
…#965)

* feat!: consider time already elapsed when waiting after the commit

* chore: minor doc changes

* chore: try a different default config time

* chore: try increasing the config again

* fix: use appropriate default time

* fix: docs

* chore: simplify addition and rename

* chore: revert pointless go mod tidy change

* chore: consistent config

* chore: replace config comments

* chore: fix a few remaining round -> height name changes

* fix: lingering compile errors from merge

* fix: silly bug

* fix: use correct next start time

* dynamic block time modifications (#983)

* chore: remove event collector

* Update test/maverick/consensus/state.go

Co-authored-by: Callum Waters <[email protected]>

* Update test/maverick/consensus/state.go

Co-authored-by: Callum Waters <[email protected]>

* chore: formatting

---------

Co-authored-by: Callum Waters <[email protected]>
cmwaters added a commit that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants