Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use real-time sockets in federated execution #274

Merged
merged 11 commits into from
Sep 28, 2023
Merged

Conversation

erlingrj
Copy link
Collaborator

@erlingrj erlingrj commented Sep 18, 2023

Got a bug report from @lhstrh which I was able to reproduce in the following trivial federated program:

target C {
    timeout: 100 msec
}
reactor Src {
    timer t (0, 50 msec)
    output out:int
    reaction(t) -> out {=
        lf_set(out, 42);
    =}
}
reactor Sink {
    input in:int 
    reaction(in) {=
        lf_print("Got %d at %li", in->value, lf_time_physical_elapsed());
    =}
}
federated reactor {
    src = new Src()
    sink = new Sink()

    src.out -> sink.in
}

I observed a strange ~40msec latency between Src and Sink. I tracked it down be a communication latency between Src and the RTI. After alot of digging I have found out that there is something called Nagle algorithm, enabled by default, which bundles short TCP messages together to avoid too much network traffic. So it was delaying our messages. In this draft I have just added socket options for disabling with TCP_NODELAY. It is still a draft because we can discuss which sockets this should apply for, maybe all? After we decide, lets put creation of sockets in net_util.c so we dont litter federate.c with these low-level socket option details.

@lhstrh this PR should fix your issue. You should recompile and reinstall the RTI also based on this PR.

@erlingrj
Copy link
Collaborator Author

erlingrj commented Sep 18, 2023

We should also investigate whether delayed ACKs can be problematic for us: https://en.wikipedia.org/wiki/Nagle%27s_algorithm#Interaction_with_delayed_ACK

@erlingrj
Copy link
Collaborator Author

The DistributedStopDecentralized.lf test now fails. This means either 1) I made a mistake here, 2) We shouldnt use TCP_NODELAY for federat-to-federate sockets or 3) Race condition in federated infrastructure related to stop requests which surface with this change

@edwardalee
Copy link
Contributor

edwardalee commented Sep 21, 2023

Using the debugger, I just found a serious bug in federate.c where a local variable "result" was shadowed. I've pushed a fix to this branch.

Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks great to me. Let's see if my bug fix enables the tests to pass.

@lhstrh
Copy link
Member

lhstrh commented Sep 22, 2023

Using the debugger, I just found a serious bug in federate.c where a local variable "result" was shadowed. I've pushed a fix to this branch.

I think the bug was in Erling's change, not the original code, right?

@edwardalee
Copy link
Contributor

Yes, the bug was not in the original code (I initially thought it might have been, but nothing would have worked).

core/federated/net_util.c Outdated Show resolved Hide resolved
@erlingrj
Copy link
Collaborator Author

After Christian's fix of the error reporting system, I will now investigate this test failure and fix it so we can merge it soon.

@erlingrj erlingrj changed the title Draft: Use real-time sockets in federated execution Use real-time sockets in federated execution Sep 26, 2023
@erlingrj
Copy link
Collaborator Author

@lhstrh, I got a test failure on a decentralized federated program on macOS. No debug output. Unable to reproduce in Ubuntu. Went away when re-running the tests. I have taken a close look at this PR and only reason I can come up with is either:

  1. There is a race condition in decentralized federated which sometimes is triggered now because we have less delays
  2. Some unrelated CI issue is at fault.

@lhstrh
Copy link
Member

lhstrh commented Sep 27, 2023

My bet would be on there being something fishy with the coordination... I don't think we can blame CI for this one (nor the changes in this PR).

@erlingrj
Copy link
Collaborator Author

Then I will merge this and we will have to look at federated race conditions at another time

@erlingrj erlingrj merged commit eb96e25 into main Sep 28, 2023
28 checks passed
@lhstrh lhstrh added the enhancement Enhancement of existing feature label Jan 23, 2024
@lhstrh lhstrh deleted the real-time-sockets branch February 1, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement of existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants