RDC-robust credit initialization #73

bgelb-openai · 2024-10-24T16:47:34Z

It would be useful to define a credit exchange protocol for credit/valid style channels that cross a reset domain boundary.

The baseline credit_stall based protocol is a good start, but really only handles initial reset release:

sender drives a "stall" signal to the receiver when it is in reset
if receiver comes out of reset first, it won't release credits to the sender until the sender deasserts its stall signal

What is needed is to handle arbitrary re-entry into reset of one or both sides of the channel. The desired behavior is:

data can get lost (unavoidable)
channel always reaches a consistent state where both sides have a correct view of credits (and the correct amount of credit) after either (or both) sides undergo any arbitrary reset
either side get a positive notification when the other side goes through a reset, so that it can take any cleanup action (if necessary/desired)
data in flight at time of reset assertion will never be delivered at the destination after the re-initialization completes (i.e. any data after the notification can be assumed valid from the current reset epoch)

I haven't mapped out in detail, but think this should need like a 3-4 state FSM on each side, the state of which is exported to the other link partner.

mgottscho · 2024-10-24T17:17:01Z

Would the scope of this protocol just be for managing credits across the RDC, or would it be part of a larger reset architecture? It seems there is some desire for the latter, given the mention of other cleanup actions.

Additionally, since this would be for an RDC, is the assumption that each domain is completely independent of the other, i.e., domain A can enter and exit reset freely while domain B may always be out of reset?

bgelb-openai · 2024-10-24T18:32:14Z

I would not want this common component to be too opinionated about a larger reset architecture. Surely it could provide useful capabilities to that end, but I don't think that is necessarily the point.

I think the desired behavior here is good from a resilience POV generally. If two independent components are connected by a credited channel, and one of them encountered an error/problem that can only be recovered from via a reset, it is generally possible to reset just that component (perhaps in combination w/ some additional action to quiesce traffic at a higher level) and recover.

That is a good property that generally opens up a lot of options from a survivability POV.

mgottscho added the enhancement New feature or request label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDC-robust credit initialization #73

RDC-robust credit initialization #73

bgelb-openai commented Oct 24, 2024

mgottscho commented Oct 24, 2024

bgelb-openai commented Oct 24, 2024

RDC-robust credit initialization #73

RDC-robust credit initialization #73

Comments

bgelb-openai commented Oct 24, 2024

mgottscho commented Oct 24, 2024

bgelb-openai commented Oct 24, 2024