-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalization stalling issues 2024-07 #1104
Comments
Explore-logs-2024-07-26 10_41_41.txt |
Current plan for a workaround is to delay the finality for more than 2 blocks. Currently we are planning to delay the finality after by |
Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition. It shows the root cause that we face. |
We have analyzed the code, and it looks like the issue is with those two PRs:
We think that the fix that is added at paritytech/substrate#13364 does not apply after the code added at paritytech/substrate#13289 returns an error. We'll investigate further. We are still on course to apply the workaround mentioned above as a hotfix, but the long term solution is also in sight. |
To elaborate more on paritytech/substrate#13364:
|
The test is succeeding? |
Yep, we've implemented it in the way to verify our thoughts about the happened case. |
We met the same error
|
I have only looked into this superficially, but I think @MOZGIII comment on the |
The same thoughts we discussed with @MOZGIII |
@bkchr @andresilva should we expect any hotfix for this implemented at the parity side or rather work on this ourselves? |
@bkchr @andresilva let me create PR of this test into our fork and describe it in more details |
I think this should fix it. |
paritytech/polkadot-sdk#5153 looks good, thx! This would still be risky to ship for us as it means the whole layer of logic dedicated to the grandpa-issued reorg was not used for a while now and may contain bugs or just be broken cause of being stale... We'll think about how to test this. So far we have thought about replicating the scenario of having 1k+ validators on our load-test facilities - but in our prior load tests we haven't seen the finality stalls like we see in the live chain; that might be due to not running for enough time, or due to other conditions - so it unlikely we'll have an accurate test for this in the load testing environment. Any thoughts on how to trigger this bug in the live-like environment? |
I think besides any significant changes to make sure the issue happens (e.g. changing vote logic to make sure validators vote on abandoned forks explicitly), the easiest way would be to make forks more likely to happen and hence make it more likely for the issue to surface. I think changing https://github.com/humanode-network/humanode/blob/master/crates/humanode-runtime/src/constants.rs#L49 to This shouldn't be changed on any live chain though, since these parameters can't be changed dynamically. You'd need to bootstrap a new test chain. |
Great idea! We will also change the block time from 6s to 1s for more stress... Pretty sure it is fine, but just to double check - is it safe to ship f5bccf0 ? |
Yes, don't see any problem with that (and should make the issue less likely to happen). For the test above you can also remove the |
Great! To summarize, we'll test your fix with the following:
UPD: implemented at #1106 |
We did not push the fix just yet, only the bigger finalization delay But we see this: The issue seems to be that once the prevote for a round goes out it then can't progress and the grandpa stalls. @andresilva @bkchr let me know if you need access to our infra |
Maybe some peers are not voting in the round because of the bug (and some do - like the one we have above), and thus the whole consensus stalls, as if the amount of online nodes is less that quorum... Only 39% of the nodes have applied the finality delay update according to https://telemetry.humanode.io |
Yes, you'd need >66% of the nodes on the new version for it to effectively take effect. |
We are backporting the fix to our substrate fork; we are still working on reproducing the issue. @dmitrylavrenov will send an update when we have news on that front. |
Status update Still working on implementing our own Current plan is to run different load tests this week trying to reproduce faced issue. |
Why not zombienet? |
Will prepare the detailed answer a little bit later. |
Status update We were able to prepare a such config of our Usually it requires less than 1 hour. reproduced-grandpa-issue-logs-1.txt Next step is preparing an image containing the fix and run tests again. |
On zombienet: in short, we tired it, and had issues with it. The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago). We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion. Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream. Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to. We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta). UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes ( |
I can add some important items from practical usage:
|
Status update @bkchr @andresilva we have great news! |
@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet? |
Thanks for the thorough testing 👌 |
Yes. The investigation is now finished, and we were able to identify the bug, theorize and implement a fix, reproduce the bug reliably in a controlled environment, apply the fix and ensure the bug does not reproduce under the same conditions anymore, and that there are no other surprising effects from applying the fix. We are now working on releasing the fix on the Humanode network. Thanks everyone for taking part in this work! |
This is a tracking issue for the recurring finalization stalling incidents.
So far, here are the pointers we have.
Commits that are affecting the finalization directly:
Should not the the cause though, as the issue is with the inability of grandpa to follow a reorg, which is cause by pruned blocks. Or, rather, it could be the cause, but this one is not really where the bug we'd like to fix is.
could this be the reason?no, this one works as intendedNotable historical commits that might've affected this incident:
uncles
related code paritytech/substrate#13216 (actually looks somewhat interesting)Probably irrelevant:
BlockId
removal:runtime-api
refactor paritytech/substrate#13255Related things from the future:
The thoughts we have so far is that the issue is caused by the broken block pruning (which is meaningful when an unfortunate best block selection happens).(UPD: the blocks are not really pruned, we can still request them from the API; this means this is a selection issue) Grandpa tries to finalize the block that is pruned and not the part of the best chain anymore; upon receiving the precommits it fails to recognize them because the block can't be resolved (it's pruned). This is quite odd, since the nodes operate with--blocks-pruning archive
, so it must mean they haven't seen the block in question as is it no never to be pruned if seen - but in reality we see a the reorg happening from the block in question to a new one, so the old block is definitely a known one.We are currently on substrate 0.9.40.
Mainnet encountered this at least three times lately:
The text was updated successfully, but these errors were encountered: