Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return the round of a unit with the unit data when the unit is finalized #376

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

joschisan
Copy link
Contributor

@joschisan joschisan commented Nov 25, 2023

Fedimint uses Aleph BFT to order a stream of transactions with unpredictable gaps between while ordering a transaction is latency critical. Hence we need to switch sessions before the exponential slowdown starts and can not wait for a fixed number of transactions to achieve consensus on when to trigger the next session switch. Currently nodes submit batches of transactions which may be empty and we trigger a session switch every fixed number of (possibly empty) batches. In a byzantine fault tolerant setting with 3f+1 nodes we can assume that we order at least f batches in every round, however, if all nodes are correct and online we will likely order 3f+1 batches per round, hence we switch the session about three times faster than necessary to prevent running into the exponential slowdown. The times between session switches therefore depend on how many peers are offline / malicious.

If we were to return the round index of an ordered unit with the unit data we can simply trigger the session switch if the round index matches the offset for the exponential slowdown. This gives us more predictable session times, a cleaner way to trigger the session switch and allows us to increase session length by a factor of three without increasing our theoretical bound on RAM consumption per session. Since we opted not to overlay sessions and exchange signatures over the broadcast for simplicity we have a 3x latency spike every session switch - increasing session times 3x this way would therefore by quite useful to us.

Copy link

Please make sure the following happened

  • Appropriate tests created
  • Infrastructure updated accordingly
  • Updated existing documentation
  • New documentation created
  • Version bumped if breaking changes

@joschisan joschisan force-pushed the return_round branch 4 times, most recently from e71bca8 to 5267b5b Compare November 25, 2023 10:26
@ggawryal
Copy link
Contributor

ggawryal commented Dec 8, 2023

Hi @joschisan
Thanks for the PR. While it is concise, we are a little concerned about its changes in AlephBFT API - the way they expose some internal data is quite serious API commitment and it is not something we'd like to support. Soon we plan to add performance reports of nodes, which will include these (among many different) information via other, in our opinion, more useful and reliable interface.
We believe these reports should expose all the data you need, however we are not entirely sure what other changes you would like to be incorporated in AlephBFT in the future. We'd like to better understand your use case, so if you shared some more plans, we could discuss what common points we have, and exchange some comments on how this ideally could be done. Alternatively, if you would rather expose these AlephBFT internals directly, maintaining an own fork might also be an option to consider.

@joschisan
Copy link
Contributor Author

joschisan commented Jan 23, 2024

Hi @ggawryal, sorry for the delay I got sidetracked and then this pr slipped my mind. When I assume you say use case do I assume correctly that you refer to the Fedimint project in general - if so there are actually no further changes besides this bugfix: #399. AlephBFT in its current form fits into our use case quite well, which is why I opted to upstream with the last change as well. So if you have questions on the integration into Fedimint that go beyond what I explained above we can always arrange a call on this (I lead the migration of the system from HBBFT to AlephBFT over the last year).

On this pr, I would seem to me that using the round index to trigger a session switch is a very clean way to do it and it furthermore seems to be required every time you have a continuously running system. I understand that exposing an implementation detail like this seems tacky, however at least this seems to be a stable one at least as I assume you plan on keep the DAG structure to order items. If there is anything unclear with the session switch I described above we can also discuss this in a call anytime.

@timorleph
Copy link
Contributor

We had quite a lot of discussion about this PR within the team and we came to the conclusion that we prefer a more minimalistic interface – using the internal implementation details is in general dangerous and switching between sessions should be done on a completely separate layer, using the ordered data. The round limit is only supposed to be there to avoid OOMs and clearly signal to devs that switching sessions is a necessary part of using ABFT. In fact, we would probably prefer to have an interface without the last addition of the unit creator to the output as well.

That being said, we would be open to having a secondary, less stable, interface for People Who Are Sure They Know What They Are Doing™. This one would probably expose the full structure of the batches and underlying DAG, so return batches of (creator, round, unit_hash, parent_hashes, data) (or rather some struct containing these ofc) after every ordering. It seems this would solve all of your problems, although we wouldn't officially recommend using this alternative interface. We don't currently have plans, designs, nor resources for starting such a project, but if you proposed a PR we would likely accept it.

@joschisan joschisan force-pushed the return_round branch 3 times, most recently from 0f8d418 to 73dc111 Compare March 12, 2024 13:52
@joschisan
Copy link
Contributor Author

@timorleph I have implemented a version of a secondary interface. So far it only exposes creator and round, while unit_hash and parent_hashes could be returned this way as well. Please let me know if this implementation fits what you had in mind.

Copy link
Contributor

@fixxxedpoint fixxxedpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fixxxedpoint fixxxedpoint merged commit 6aa0e29 into Cardinal-Cryptography:main Mar 18, 2024
9 checks passed
@joschisan
Copy link
Contributor Author

Thanks for merging this so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants