Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parachain Blocktimes Increasing #6910

Open
wilwade opened this issue Dec 16, 2024 · 23 comments
Open

Parachain Blocktimes Increasing #6910

wilwade opened this issue Dec 16, 2024 · 23 comments

Comments

@wilwade
Copy link
Contributor

wilwade commented Dec 16, 2024

What is happening?

The average parachain blocktime appears to be increasing slowly over the last 7-14 days.

I'm working on collecting better data, but here's an image of an internal chart for Frequency (6s blocktimes) that shows the number of blocks in the last 7 days over 30 minutes and the trend line over the last 7 days is clear.

2024-12-16 10 34 52

Is it just Frequency? Nope

While it was having trouble, I checked with a script the lag of Frequency vs the lag of Mythos (another 6s blocktime parachain), and over 1000 blocks, the lag matched. (Trying to download a dump of blocktimes from Frequency and Mythos and will share once collected)

Additionally other chains have seen the issue and I hope will post here as well.

There is also no clear correlation between volume and the lag or other issues (which there could be, but as other chains are seeing it, unlikely to be the issue with the trend)

Ideas on what it might be?

The Turboflakes Validator dashboard doesn't show any issues, but that would be my first thought.

Additional Notes

  • There has been a consistent dip in blocktimes from 0930-1200 UTC for Frequency, but I've not yet been able to validate that it is seen on other chains yet. This has been ongoing for a while and been trying to track it down.
  • Frequency is on Polkadot SDK v1.13.0
@wilwade wilwade changed the title Parachain Blocktimes increasing Parachain Blocktimes Increasing Dec 16, 2024
@bkchr
Copy link
Member

bkchr commented Dec 16, 2024

@sandreim @alexggh can you please take a look.

@alexggh
Copy link
Contributor

alexggh commented Dec 16, 2024

@sandreim @alexggh can you please take a look.

First of all there hasn't been a release in that timeframe, so I don't think it is something from a new release.

Will try to have a look tomorrow, just clicking through the relay chain blocks it seems that there are blocks where no-block is included, so I think that might be the reason you are seeing this.

E.g:
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0xb35995b7d16148c82c0dc0ad31ef6f1f4c2878d51f4a26dec409e689a36ba574

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0xdf674f849703488ec923df212dce27155461441988466589bcc7118b65952cb8

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0x81ee51ce255741c50b9f37adee230f9bdf208a88151c106f67cc8573ccb161cb

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.public.curie.radiumblock.co%2Fws#/explorer/query/0x6b9639227545b15f3349c4f6d7b05c2bc575c4ba7a381b16565e71a5a868c3c6

So, basically all parachain blocks sometimes are skipping a relaychain block from backed to included.

The availability bitmaps from those blocks seems to indicate the 2/3 quorum is not reached so I think they are correctly not included in the first relay chain blocked after being backed.

So, we need to understand why some of the availability bitmaps for some of the validators are 0 sometimes, one possible reason for this would be that, those validators are simply slow either on network or on CPU.

@sandreim
Copy link
Contributor

sandreim commented Dec 16, 2024

Yes, there is a clear trend here as seen here:

Screenshot 2024-12-16 at 18 21 32

The availability bitmaps from those blocks seems to indicate the 2/3 quorum is not reached so I think they are correctly not included in the first relay chain blocked after being backed.

So, we need to understand why some of the availability bitmaps for some of the validators are 0 sometimes, one possible reason for this would be that, those validators are simply slow either on network or on CPU.

One idea is that this can happen if some networking subsystem channel is full, this would lead to a stall of the bridge, so bitfields are not processed by block author in the required time window.

I have been seeing bitfield distribution being pressured at times and it's channel getting full on Kusama on our own validators:

Screenshot 2024-12-16 at 18 28 22

However, I don't see why this started suddenly without any change.

@eskimor
Copy link
Member

eskimor commented Dec 16, 2024

Depending on the root cause, speculative availability as described here might help and would actually be rather quick to implement.

@eskimor
Copy link
Member

eskimor commented Dec 16, 2024

In the data I can't see any significant correlated increase, except for active accounts:

screenshot-2024-12-16-16-59-18

block weights and size don't seem to have increased at all in that timespan. 🤔

@sandreim
Copy link
Contributor

Yeah, I don't think the load has increased, could it be just some under spec validators got into the set, currently I see 3 F grade validators on Polkadot.

@eskimor
Copy link
Member

eskimor commented Dec 16, 2024

Frequency seems to be affected the most:

screenshot-2024-12-16-17-05-07

in the data I don't see anything that has changed with Frequency blocks though. Block size of phala (best chain) seems similar to Frequency.

@eskimor
Copy link
Member

eskimor commented Dec 16, 2024

Yeah, I don't think the load has increased, could it be just some under spec validators got into the set, currently I see 3 F grade validators on Polkadot.

Weird though that it seems to increase continuously over days.

@burdges
Copy link

burdges commented Dec 17, 2024

Are there more or bigger parablocks? Someone else bought some cores recently, right?

@crystalin
Copy link

Additional metrics that could help would be the average/sum of the PoV size as well as execution time (maybe harder to get)

@alexggh
Copy link
Contributor

alexggh commented Dec 17, 2024

I have been seeing bitfield distribution being pressured at times and it's channel getting full on Kusama on our own validators:

If that's the case then this will help and it should be coming in the next release: https://github.com/paritytech/polkadot-sdk/pull/5787/files.

However, just looking at the metrics we have, it doesn't seem there is any increase in load, finalization seems to be stable, the number of parachain blocks included around the same numbers. So I tend to think there is something going wrong on some validators, so I'm going to try to build some statistics regarding the ones missing bitfields to see if we can pin-point which ones became slow.

@alexggh
Copy link
Contributor

alexggh commented Dec 17, 2024

From this blocks that have no included parachain blocks:
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0x99549b6c4450220d44c6a3d4d2dc62505338cbf7fff70723c5d073437060bfbb
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0xd5e08346906a534140b596968fc972176e72466c2cabeb3f9afa94d0acbf9e20
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0xbfd891fa8cda1b0c7d1f85549b36db36cd2043dadfd0e3bba613a679d9288bc3
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-rpc.publicnode.com#/explorer/query/0xf66c2a4601c61ebf58decc99fdf079f2bd9019c340d50cacfd10753eb2359d93

It seems there are a lot of validators where the bitfield looks like this, a big 0.

0b00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000

That tells us that the bitfield arrive on the block author, so I don't think there is a problem with the bitfield-distribution, because bitfields that don't arrive aren't noted on chain, but with the availability being slow on the said validators. I built the list here and for all the block it seems to be kind of the same validators, I'll ask on the validators channels to see if I can get some logs.

bad_validators.txt

@eskimor
Copy link
Member

eskimor commented Dec 17, 2024

But more than a third of the validators would need to be slow on availability for this to cause issues. Just a few should be no problem at all. All 0 can also hardly be explained by slowness, there must be some greater malfunction on that node.

@alexggh
Copy link
Contributor

alexggh commented Dec 17, 2024

But more than a third of the validators would need to be slow on availability for this to cause issues.

Yes, there seems to be more that 1/3 with 0 on those blocks, I asked on this channel here: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17344363742DoDav:parity.io?via=parity.io&via=corepaper.org&via=matrix.org logs from the validators, hopefully that tells us more.

@alexggh
Copy link
Contributor

alexggh commented Dec 17, 2024

polka-logs.txt

Got some logs from one of the validator, one thing I notice is that we seem to have a lot of Reorgs on the relaychain in 6 hours span we have 400 reorgs, that's a Re-org every other ~10 blocks, that also correlates with the blocks I've seen to have 0 included candidates, so it might be related or not.

@eskimor
Copy link
Member

eskimor commented Dec 17, 2024

Interesting. Would need to check, but one good explanation for those all zeros would be that backing already failed on that fork that got revealed late. If nothing was occupying the core, then availability obviously will also not run. Another explanation, that block triggering the reorg was revealed so late that availability had no chance to run. (Seems less likely, given the perfect zero)

But if that first theory were to hold, then all the validators would have sent all zeros.

@alexggh
Copy link
Contributor

alexggh commented Dec 17, 2024

I think the Re-org is a red herring, I looked at the logs going as far as 17th of November and it seems they always happened in the same percentages, so it doesn't correlate with the start of this problem.

@aaron2048
Copy link

aaron2048 commented Dec 17, 2024

FWIW, we see increasing block times on Moonbeam as well. What's especially concerning is it doesn't seem to have plateaued yet.

increasing_blocktimes

@wilwade
Copy link
Contributor Author

wilwade commented Dec 17, 2024

I finally pulled the data for the last ~11 days for both Frequency and Mythos

It is a dump of key information for each block from 2024-12-05T04:27:24.000Z to 2024-12-16T14:33:42.000Z. I hope to use it to try to better pinpoint when the lags are happening and when they are not across two chains that are close to the same tx volume, node count, collator count, etc...

So here it is in case it helps anyone else as this is all public data.

Columns:

  • blockNumber: The Parachain Block Number
  • extrinsicsCount: The number of extrinsics (signed and unsigned) in the block
  • povSize: Block PoV size, same as is shown on the Polkadot Dev UI
  • weight: Block Weight, same as is shown on the Polkadot Dev UI
  • timestamp: The value from the setTimestamp Extrinsic
  • datetime: The ISO version of Timestamp for humans
  • secondsSinceLastBlock: The number of seconds since the prior block rounded to the nearest second (block n timestamp - block n-1 timestamp)

See CSVs in the Gist: https://gist.github.com/wilwade/0fbb9dccc4f8d2f20fe4035aab422f25

@alexggh
Copy link
Contributor

alexggh commented Dec 18, 2024

A short summary, the good news is that the network seems to have healed overnight:

Average block times:

Screenshot 2024-12-18 at 09 50 20

Mythos

Screenshot 2024-12-18 at 09 49 55

Moonbeam
Screenshot 2024-12-18 at 09 50 07

What we know so far

  1. Starting 12th of December there seem to be a number of validators which had their availability bitfields with 0, in some relaychain blocks, that seems to point to some availabilty slowness on those validators for all parachains just on some blocks, before and after that block their availability bitfields seem alright.
  2. Validators of all versions seem to be affected, see the list here: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17344363742DoDav:parity.io?via=parity.io&via=corepaper.org&via=matrix.org.
  3. Restart of the node seems to help I tested with 3 validators(2 upgraded from 1.16.2 to 1.17) and 1(restarted on 1.15.2) all seemed to recover after restart.
  4. We just released polkadot v1.17, so I assumed that's the reason the network recovered and parachain block times look better now, because some validators upgraded.

Next

  • Root-cause what made some validators enter this state.
  • Monitor the network, to understand if this problem is coming back or was just an one-off
  • Determine if there are still validators that seem affected by this issue.

@aaron2048
Copy link

Next

  • Root-cause what made some validators enter this state.
  • Monitor the network, to understand if this problem is coming back or was just an one-off
  • Determine if there are still validators that seem affected by this issue.

Monitoring on an ongoing basis, even once the issue is resolved would be great.

@wilwade
Copy link
Contributor Author

wilwade commented Dec 18, 2024

I think there is a group that is still having trouble. In the Polkadot epoch from ~0900-1200 UTC today (2024-12-18) both Mythos and Frequency had the exact same total seconds of block delays.

Here's the last 48 hours, and you can see the recovery noted by @alexggh, but then a dip from ~0900-1200 UTC:

image

That timing lines up really close to the epoch timing on Polkadot, so I'm thinking this might still be part of the issue or perhaps closer to the root cause?

@alexggh
Copy link
Contributor

alexggh commented Dec 19, 2024

Root cause

Alright, I think I've got the full picture now, we've got this long standing issue #5258 (all known versions except v1.17.0 are affected), where availability-distribution's cpu usage is increasing slowly but steady with around 1% of cpu usage per-day. So the longer a validator runs availability-distribution slows down, to a point where it can't fetch its PoV chunks and populate the bitfields in time(1.5s since the block was imported).

This is the last 6 months from a polkadot validator affected by this issue(kudos to tugy | Amforc)
tuggy

So, the longer a validator rans before a restart the slower it gets, the validators I've been in contacted ran for 30 days and another for ~3 months.

Therefore, once the active has enough of such validators(~1/3) the parachain inclusion slows down as well and it starts skipping blocks, because less than 2/3 informed the block author that they have their PoV chunk.

Solution

The issue has been accidentally fixed with #5875, which is included in the v1.17.0 release, all validators that upgrade are not missing bitfields anymore and their CPU usage is not increasing, also confirmed on versi by running the version before and after the commit.

Conclusion

I don't expect a repeat of this, there are already 184 validators on polkadot that upgraded to the unaffected polkadot v1.17.0, with more to follow probably shortly.

Action items

  • Notify validators on the validator channels, that if their node ran for more than 30 days without restart, they might need to restart it every now and then or simply upgrade to the latest version which is unaffected by this issue.
  • Find a home for a dashboard, that can aggregate and show validators that are not populating their availability bitfields from the onchain data, so that we can pinpoint when things go awry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants