-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parachain Blocktimes Increasing #6910
Comments
First of all there hasn't been a release in that timeframe, so I don't think it is something from a new release. Will try to have a look tomorrow, just clicking through the relay chain blocks it seems that there are blocks where no-block is included, so I think that might be the reason you are seeing this. So, basically all parachain blocks sometimes are skipping a relaychain block from backed to included. The availability bitmaps from those blocks seems to indicate the 2/3 quorum is not reached so I think they are correctly not included in the first relay chain blocked after being backed. So, we need to understand why some of the availability bitmaps for some of the validators are 0 sometimes, one possible reason for this would be that, those validators are simply slow either on network or on CPU. |
Yes, there is a clear trend here as seen here:
One idea is that this can happen if some networking subsystem channel is full, this would lead to a stall of the bridge, so bitfields are not processed by block author in the required time window. I have been seeing bitfield distribution being pressured at times and it's channel getting full on Kusama on our own validators: However, I don't see why this started suddenly without any change. |
Depending on the root cause, speculative availability as described here might help and would actually be rather quick to implement. |
Yeah, I don't think the load has increased, could it be just some under spec validators got into the set, currently I see 3 F grade validators on Polkadot. |
Weird though that it seems to increase continuously over days. |
Are there more or bigger parablocks? Someone else bought some cores recently, right? |
Additional metrics that could help would be the average/sum of the PoV size as well as execution time (maybe harder to get) |
If that's the case then this will help and it should be coming in the next release: https://github.com/paritytech/polkadot-sdk/pull/5787/files. However, just looking at the metrics we have, it doesn't seem there is any increase in load, finalization seems to be stable, the number of parachain blocks included around the same numbers. So I tend to think there is something going wrong on some validators, so I'm going to try to build some statistics regarding the ones missing bitfields to see if we can pin-point which ones became slow. |
From this blocks that have no included parachain blocks: It seems there are a lot of validators where the bitfield looks like this, a big 0.
That tells us that the bitfield arrive on the block author, so I don't think there is a problem with the |
But more than a third of the validators would need to be slow on availability for this to cause issues. Just a few should be no problem at all. All 0 can also hardly be explained by slowness, there must be some greater malfunction on that node. |
Yes, there seems to be more that 1/3 with 0 on those blocks, I asked on this channel here: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17344363742DoDav:parity.io?via=parity.io&via=corepaper.org&via=matrix.org logs from the validators, hopefully that tells us more. |
Got some logs from one of the validator, one thing I notice is that we seem to have a lot of |
Interesting. Would need to check, but one good explanation for those all zeros would be that backing already failed on that fork that got revealed late. If nothing was occupying the core, then availability obviously will also not run. Another explanation, that block triggering the reorg was revealed so late that availability had no chance to run. (Seems less likely, given the perfect zero) But if that first theory were to hold, then all the validators would have sent all zeros. |
I think the Re-org is a red herring, I looked at the logs going as far as 17th of November and it seems they always happened in the same percentages, so it doesn't correlate with the start of this problem. |
I finally pulled the data for the last ~11 days for both Frequency and Mythos It is a dump of key information for each block from So here it is in case it helps anyone else as this is all public data. Columns:
See CSVs in the Gist: https://gist.github.com/wilwade/0fbb9dccc4f8d2f20fe4035aab422f25 |
A short summary, the good news is that the network seems to have healed overnight: Average block times: Mythos What we know so far
Next
|
Monitoring on an ongoing basis, even once the issue is resolved would be great. |
I think there is a group that is still having trouble. In the Polkadot epoch from ~0900-1200 UTC today (2024-12-18) both Mythos and Frequency had the exact same total seconds of block delays. Here's the last 48 hours, and you can see the recovery noted by @alexggh, but then a dip from ~0900-1200 UTC: That timing lines up really close to the epoch timing on Polkadot, so I'm thinking this might still be part of the issue or perhaps closer to the root cause? |
Root causeAlright, I think I've got the full picture now, we've got this long standing issue #5258 (all known versions except v1.17.0 are affected), where availability-distribution's cpu usage is increasing slowly but steady with around 1% of cpu usage per-day. So the longer a validator runs This is the last 6 months from a polkadot validator affected by this issue(kudos to tugy | Amforc) So, the longer a validator rans before a restart the slower it gets, the validators I've been in contacted ran for 30 days and another for ~3 months. Therefore, once the active has enough of such validators(~1/3) the parachain inclusion slows down as well and it starts skipping blocks, because less than 2/3 informed the block author that they have their PoV chunk. SolutionThe issue has been accidentally fixed with #5875, which is included in the ConclusionI don't expect a repeat of this, there are already 184 validators on polkadot that upgraded to the unaffected polkadot v1.17.0, with more to follow probably shortly. Action items
|
What is happening?
The average parachain blocktime appears to be increasing slowly over the last 7-14 days.
I'm working on collecting better data, but here's an image of an internal chart for Frequency (6s blocktimes) that shows the number of blocks in the last 7 days over 30 minutes and the trend line over the last 7 days is clear.
Is it just Frequency? Nope
While it was having trouble, I checked with a script the lag of Frequency vs the lag of Mythos (another 6s blocktime parachain), and over 1000 blocks, the lag matched. (Trying to download a dump of blocktimes from Frequency and Mythos and will share once collected)
Additionally other chains have seen the issue and I hope will post here as well.
There is also no clear correlation between volume and the lag or other issues (which there could be, but as other chains are seeing it, unlikely to be the issue with the trend)
Ideas on what it might be?
The Turboflakes Validator dashboard doesn't show any issues, but that would be my first thought.
Additional Notes
The text was updated successfully, but these errors were encountered: