Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPS-???? | Block Delay Centralisation #943

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

TerminadaDrep
Copy link

@TerminadaDrep TerminadaDrep commented Dec 3, 2024

Abstract

An underlying assumption in the design of Cardano's Ouroboros protocol is that the probability of a stake pool being permitted to update the ledger is proportional to the relative stake of that pool. However, the current implementation does not properly realise this design goal.

Minor network delays endured by some participants cause them to face a greater number of fork battles. The result is that more geographically decentralised participants do not obtain participation that is proportional to their relative stake. This is both a fairness and security issue.


(rendered latest version)

@rphair rphair changed the title Terminada-CPS-block-delay-centralisation CPS-???? | Block Delay Centralisation Dec 3, 2024
Copy link
Collaborator

@rphair rphair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TerminadaDrep thanks for getting this into the open after putting so much work into this issue & led so much constructive discussion about it in the last couple of years. I'm marking this Triage to introduce it at the CIP meeting in a week's time (https://hackmd.io/@cip-editors/102) & you would be very welcome to attend if possible & field some initial questions.

CPS-XXXX/README.md Outdated Show resolved Hide resolved
CPS-XXXX/README.md Outdated Show resolved Hide resolved
CPS-XXXX/README.md Outdated Show resolved Hide resolved
CPS-XXXX/README.md Outdated Show resolved Hide resolved
CPS-XXXX/README.md Outdated Show resolved Hide resolved
CPS-XXXX/README.md Outdated Show resolved Hide resolved
@rphair rphair added State: Triage Applied to new PR afer editor cleanup on GitHub, pending CIP meeting introduction. Category: Consensus Proposals belonging to the `Consensus` category. labels Dec 3, 2024

This might seem like a minor problem, but the effect is significant. If the majority of the network reside in USA - Europe with close connectivity and less than 1 second propagation delays, then those participant pools will see 5% of their blocks suffering "fork battles" which will only occur when another pool is awarded the exact same slot (ie: a "slot battle"). They will lose half of these battles on average causing 2.5% of their blocks to get dropped, or "orphaned".

However, for a pool that happens to reside on the other side of the world where network delays might be just over 1 second, this pool will suffer "fork battles" not only with pools awarded the same slot, but also the slot before, and the slot after. In other words, this geographically decentralised pool will suffer 3 times the number of slot battles, amounting to 15% of its blocks, and resulting in 7.5% of its blocks getting dropped. The numbers are even worse for a pool suffering 2 second network delays because it will suffer 5 times the number of "fork battles" and see 12.5% of its blocks "orphaned". This not only results in an unfair reduction in rewards, but also the same magnitude reduction in contribution to the ledger.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CPS does not provide data showing that 1s is not enough for blocks to propagate anywhere in the world with required hardware, connection and configuration. Without such data, it's impossible to determine if it's needed or not, and what solution would work.

Copy link
Collaborator

@rphair rphair Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often get 1000ms ping times just crossing India's border: though AWS Mumbai is probably exempt from such delays, which I imagine result from "great firewall" type packet inspection from the newly founded & somewhat ill-equipped surveillance state here. (p.s. this has become an additional reason why our pool is administered in India but all nodes are in Europe: which supports the author's premise fairly well.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was produced by my BP earlier today. It was full at 86.57kB in size, containing 64 transactions, and 66.17kB of scripts: https://cexplorer.io/block/c740f9ce8b25410ddb938ff8c42e12738c18b7fd040ae5224c53fb45f04b3ba0

These are the delays (from beginning of the slot) before each of my own relays included this block in their chains:

  • Relay 1 (ARM on same LAN) → Delayed=0.263s
  • Relay 2 (AMD on adjacent LAN) → Delayed=0.243s
  • Relay 3 (ARM approx 5 Km away) → Delayed=0.288s
  • Relay 4 (AMD Contabo vps in USA) → Delayed=2.682s
  • Relay 5 (ARM Netcup vps in USA) → Delayed=1.523s

The average propagation delay by nodes pinging the data to pooltool was 1.67 seconds: https://pooltool.io/realtime/11169975

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TerminadaDrep Could you add the above delay metrics to the CPS? I think having empirical data would help strengthen the case of this CPS. Also, could you indicate whether your BP is locally controlled or in a vps? I'm guessing it is locally controlled.

Copy link
Contributor

@SmaugPool SmaugPool Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't say in which country is the BP.
Also I'm not sure we should target low spec VPS nodes, is that an aim of Cardano? Or even just VPSs?

Good nodes require a good control of the hardware and software, which VPSs don't really offer. Some in this list are particularly known to provide bad performance, and virtualization adds overhead.

Moreover configuration optimization can help with latency (tracing, mempool size, TCP slow start, congestion control, etc..), so more details are needed.

Overall I believe we cannot conclude that 1s is not enough with just this data point.

As a counter-example here is a SMAUG full 86.37 KB block with 97 transactions propagated in average in 0.46s:
https://pooltool.io/realtime/11147794

My furthest relay in Singapore received it in 550ms.

And most of my blocks propagate quicker than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not generally saying intercontinental propagation is irrelevant here. It actually is measurable and enforced by speed of light physical law. So I also disagree with one of the points in the CPS, in expectations of future transmission and latency improvements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that the delay is clearly in the Australian part of the internet. Perhaps the Aussie national broadband network (NBN) was more congested than usual at this time.

Certainly this block had worse than usual propagation delays.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more examples which include pooltool data for a couple of pools in Japan.

I also added a new "Arguments against" No: 5. to discuss the extra infrastructure cost requirements for a BP in Australia to try to reduce the disadvantage that is inherent to the current Ouroboros implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I also disagree with one of the points in the CPS, in expectations of future transmission and latency improvements.

@gufmar I'm not an expert on networks, but I really don't think we should be relying on improvements to network throughput here due to the rebound effect. Demand will very likely increase to use up any extra "slack" (eg, Leios, block size increases, and even non-blockchain demand). If network capacity doubles but so does demand, we can easily find ourselves here again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry if I was unclear. I wanted to say I disagree with the argument made at https://github.com/cardano-foundation/CIPs/blob/e7bf9b4c103f3841f2d8364e78905c1183ee9526/CPS-XXXX/README.md#arguments-against-correcting-this-unfairness

because I don't expect we will see significant improvements in network latency which is the main limiting factor here with the TCP ACK back and forth packets, not so much the Throughput.

Co-authored-by: Robert Phair <[email protected]>

However, for a pool that happens to reside on the other side of the world where network delays might be just over 1 second, this pool will suffer "fork battles" not only with pools awarded the same slot, but also the slot before, and the slot after. In other words, this geographically decentralised pool will suffer 3 times the number of slot battles, amounting to 15% of its blocks, and resulting in 7.5% of its blocks getting dropped. The numbers are even worse for a pool suffering 2 second network delays because it will suffer 5 times the number of "fork battles" and see 12.5% of its blocks "orphaned". This not only results in an unfair reduction in rewards, but also the same magnitude reduction in contribution to the ledger.

Even the high quality infrastructure of a first world country like Australia is not enough to reliably overcome this problem due to its geographical location. But is it reasonable to expect all block producers across the world to receive blocks in under one second whenever the internet becomes congested, or if block size is increased following parameter changes? Unfortunately, the penalty for a block producer that cannot sustain this remarkable feat of less than 1 second block receipt and propagation, is 3 times as many "fork battles" resulting in 7.5% "orphaned" blocks rather than 2.5%.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some data is needed to prove the Australia's case and to be able to reproduce it and evaluate working solutions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example this AWS datacenter to datacenter round trip latency map does not seem to be enough to prove the point:
image

Copy link

@fallen-icarus fallen-icarus Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with data helping, but I want to point out that using AWS as the benchmark doesn't seem appropriate since the goal is to not have AWS control most of the block producers.

P.S. I'm not suggesting you mean to use AWS as the benchmark, I just felt like this point should be made explicit.

Copy link
Collaborator

@rphair rphair Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we can't be sure these hop times aren't between AWS back-end networks: and therefore not including time spent for unaffiliated traffic to enter & exit backbone networks or cross the "last mile" of retail Internet services.

Copy link
Contributor

@SmaugPool SmaugPool Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with data helping, but I want to point out that using AWS as the benchmark doesn't seem appropriate since the goal is to not have AWS control most of the block producers.

P.S. I'm not suggesting you mean to use AWS as the benchmark, I just felt like this point should be made explicit.

The point was that if AWS datacenter to datacenter round trip latency was already more than 1s between 2 points in the world, it would have been enough to prove the CPS point, because it's close to the best case connectivity wise (independently from the centralization issue). But that's not the case, so more data is needed. I didn't mean to say anything else, you are interpreting, so I think it was appropriate, to show that more data is indeed needed. See my original quote:

For example this AWS datacenter to datacenter round trip latency map does not seem to be enough to prove the point:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not this week, because involved in other things, but I can assure you, we have plenty of latency and propagation data. Not only general latency but actual mainnet block propagation times as transmitted and received via Ouroboros mini protocols.
And we have it for 2.5 years of history, covering a bunch of different normal and extraordinary network situations. For small and large blocks, with no, up to max script execution units.
The previously showed gantt chart here is just one visualization for one block. I'm happy to invite to a workshop call where we go through some of these data points, computed averages for predefined edge cases etc.

@happystaking
Copy link

I can only speak for my own pool, but when I compare a high spec relay (A) to a low spec relay (B) the average time from slot begin to adoption on relays A and B is 240ms and 560ms respectively.

Blocks are forged on node C which is on the same physical machine as relay A. Relay B is 5500km (~100ms) away, so that time should be subtracted from the 560ms to take network delay (same LAN, no hops) out of the equation.

This (perhaps unscientific) method leads me to believe that having the right hardware does more to improve propagation times than being close to all other nodes. Moreover pinging halfway around the world should take no more than 300ms in an optimal situation. That means you should easily have enough time to bypass any countries or areas with deep packet inspection (which cause high network delays) and still stay under 1 second.

@TerminadaDrep
Copy link
Author

Coincidentally, the last two TERM blocks in a row happened to have a leader for the next slot.

  1. A full block which pooltool reported avg propagation time of 0.87s --> But despite the average reported propagation being less than 1 second, the next producer IOGP did not receive it in time and created a fork. Unfortunately IOGP's block had the lower VRF so TERM lost the "fork battle" and got its block orphaned.
  2. A small block which pooltool reported avg propagation time of 0.62s --> Which fortunately was received by the next producer TLK in time to produce its block at the next slot. So there was no fork and TERMs block did contribute to the chain.

@TerminadaDrep
Copy link
Author

Another important consideration is that it is possible to maliciously game these forks.

The block VRF only depends on the following inputs:

  • Epoch nonce
  • Slot number
  • Pool private key

Therefore the block VRF is known ahead of time.

A malicious operator can run a modified version of cardano-node that inspects the previous block VRF, compares this to its own value, and decide whether to deliberately cause a fork or not if it knows it will win the "fork battle". This would allow a malicious group of pools to deliberately "orphan" blocks of other competitors in order to earn a higher percentage of the reward pot, and gain more control over consensus.

@kiwipool
Copy link

kiwipool commented Dec 7, 2024

Hi

Here's some data from our onsite (our own site), NZ-based baremetal operation for comparison purposes. With the exception of perhaps 46S stakepool based in Invercargill we are likely to be the most remote stakepool in the ecosystem. We are never going to win any low latency awards from our location

We run high-specification enterprize grade servers connected via gigabit fibre for our NZ relays and NZ primary BPs. In addition to NZ-based baremetal we operate cloud relays spread around the world with reputable providers on decent hardware. Our 'Plan B' cloud-based failover system is also very high specification and produces significantly lower latency numbers than our NZ-based baremetal. We choose to run our primary system on baremetal in NZ for philosophical, rather than performance, reasons.

See the below summarized pooltool data for the last 50 epochs (E476-525) for KIWI
ID: 60397646d7d1ad6fe2ddccfe7efc9cba61f6d3d94d29e8f41de73240

Slots: 1,699
Height Battles: 16
Height Battle Wins: 4
Height Battle Losses: 12
Height Battles as % of Slots: 0.94%
Height Battle Wins as % of Slots: 0.24%
Height Battle Losses as % of Slots: 0.71%
Height Battle Wins as % of Height Battles: 25.00%
Height Battle Losses as % of Height Battles: 75.00%
Average Height Battles per Epoch: 0.32
Average Height Battle Wins per Epoch: 0.08
Average Height Battle Losses per Epoch: 0.24

EDIT:
Combined Height + Slot Battle Loss as % of Slots: 3.06%

Anecdotally we appear to be experiencing more height battles in the dynamic-p2p era

Hopefully this is useful for comparison purposes.

Matticus
🥝Kiwipool Staking

@TerminadaDrep
Copy link
Author

Is it reasonable to expect that geographically remote pools must use high QoS guaranteed priority fibre plans whereas the majority in USA/EU can use the "normal" internet? Or is Ouroboros expected to function fairly with everyone using the "normal" internet?

Copy link
Collaborator

@rphair rphair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TerminadaDrep we discussed this at today's CIP meeting on Discord. The consensus was that

  1. the title & scope need to be more specifically defined as "fairness" relative to geographical limitations according to the observations that you've pointed out;
  2. the CPS must be targeted towards specific goals to alleviate those discrepancies.

Here are the recommendations that came up:

• The current title Block Delay Centralisation is ambiguous and maybe not accurate (since the delays are at perimeter, not the centre, of the network topology). We should agree upon a CPS title that states "fairness" as something currently difficult to achieve for nodes on the perimeter of the Cardano network.

• Given that a huge component of node propagation delays derive from the speed of light and a currently unchangeable TCP/IP stack, we need the CPS to be written so CIPs that fix whatever is fixable in Cardano can be attached to your problem statement: otherwise it might as well be an issue in the node repository issue queue (see further below).

• Other than altering the consensus mechanism, the only optimisations Cardano can therefore make are to address network inefficiencies. @colll78 indicated that perhaps a 20-fold improvement in network efficiency is forthcoming, and I believe this CPS should make it possible to link any such improvements on the core roadmap to the symptoms of the problem you identify.

• Extending the slot duration to accommodate network propagation times was seen as be a "brute force" solution that would have dramatic effects on the Consensus protocol, and therefore less promising than identifying points of research for improvements in network performance (hence also the category change highlighted below).

• Therefore the reviewers concluded the CPS should focus on what needs to be investigated, and then remediated, to improve fairness of Cardano PoS performance for nodes at the topological boundaries of the network.

My own question was (since I know you've been working steadily on getting visibility for these issues for a couple of years): do you have any Consensus or Node repository issues about these so far? If so, I think the issues + any responses would help to fill out the proposal along these lines... and should be linked in the CPS and/or the discussion here.

CPS-XXXX/README.md Outdated Show resolved Hide resolved
@rphair rphair added Category: Network Proposals belonging to the `Network` category. and removed Category: Consensus Proposals belonging to the `Consensus` category. labels Dec 10, 2024
@rphair rphair added State: Unconfirmed Triaged at meeting but not confirmed (or assigned CIP number) yet. and removed State: Triage Applied to new PR afer editor cleanup on GitHub, pending CIP meeting introduction. labels Dec 10, 2024
@fallen-icarus
Copy link

Extending the slot duration to accommodate network propagation times was seen as be a "brute force" solution that would have dramatic effects on the Consensus protocol, and therefore less promising than identifying points of research for improvements in network performance (hence also the category change highlighted below).

Apologies for missing the meeting, but increasing the slot duration should still be seriously considered despite any required changes. According to the discussion in the consensus working group on discord, the designers did discuss 1 second vs 2 second slot lengths but they didn't really have any hard evidence to prefer 2 seconds. So they went with 1 second since it caused fewer slot battles.

Now that we have actual geographical data, there may be hard evidence to prefer 2 seconds. I'm not saying other avenues shouldn't be explored, but the above wording makes it seem like the slot duration option was downplayed too much in the meeting.

@rphair
Copy link
Collaborator

rphair commented Dec 11, 2024

cross-referencing a vital post from @karknu on network options, limitations & potential workarounds here: https://forum.cardano.org/t/problem-with-increasing-blocksize-or-processing-requirements/140044/7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Network Proposals belonging to the `Network` category. State: Unconfirmed Triaged at meeting but not confirmed (or assigned CIP number) yet.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants