Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDC-robust credit initialization #73

Open
bgelb-openai opened this issue Oct 24, 2024 · 2 comments
Open

RDC-robust credit initialization #73

bgelb-openai opened this issue Oct 24, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@bgelb-openai
Copy link
Collaborator

It would be useful to define a credit exchange protocol for credit/valid style channels that cross a reset domain boundary.

The baseline credit_stall based protocol is a good start, but really only handles initial reset release:

  • sender drives a "stall" signal to the receiver when it is in reset
  • if receiver comes out of reset first, it won't release credits to the sender until the sender deasserts its stall signal

What is needed is to handle arbitrary re-entry into reset of one or both sides of the channel. The desired behavior is:

  • data can get lost (unavoidable)
  • channel always reaches a consistent state where both sides have a correct view of credits (and the correct amount of credit) after either (or both) sides undergo any arbitrary reset
  • either side get a positive notification when the other side goes through a reset, so that it can take any cleanup action (if necessary/desired)
  • data in flight at time of reset assertion will never be delivered at the destination after the re-initialization completes (i.e. any data after the notification can be assumed valid from the current reset epoch)

I haven't mapped out in detail, but think this should need like a 3-4 state FSM on each side, the state of which is exported to the other link partner.

@mgottscho
Copy link
Contributor

Would the scope of this protocol just be for managing credits across the RDC, or would it be part of a larger reset architecture? It seems there is some desire for the latter, given the mention of other cleanup actions.

Additionally, since this would be for an RDC, is the assumption that each domain is completely independent of the other, i.e., domain A can enter and exit reset freely while domain B may always be out of reset?

@bgelb-openai
Copy link
Collaborator Author

I would not want this common component to be too opinionated about a larger reset architecture. Surely it could provide useful capabilities to that end, but I don't think that is necessarily the point.

I think the desired behavior here is good from a resilience POV generally. If two independent components are connected by a credited channel, and one of them encountered an error/problem that can only be recovered from via a reset, it is generally possible to reset just that component (perhaps in combination w/ some additional action to quiesce traffic at a higher level) and recover.

That is a good property that generally opens up a lot of options from a survivability POV.

@mgottscho mgottscho added the enhancement New feature or request label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants