Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in the federated launch script to let failures be reported correctly #1835

Closed
wants to merge 1 commit into from

Conversation

petervdonovan
Copy link
Collaborator

This fixes another way tests can pass when they should be failing.

The behavior which apparently is currently in master is that if all federates terminate then the test passes, even if some federates terminated with nonzero exit code.

@petervdonovan
Copy link
Collaborator Author

petervdonovan commented Jun 9, 2023

The following are now failing for Python in master:

  • src/federated/DistributedCountDecentralizedLate.lf (added to this list 7/18 after running tests again to account for flakiness)
  • src/federated/DistributedLoopedAction.lf
  • src/federated/DistributedLoopedPhysicalAction.lf
  • src/federated/DistributedNoReact.lf (added 7/18)
  • src/federated/DistributedStop.lf
  • src/federated/DistributedStopZero.lf
  • src/federated/PingPongDistributed.lf (added 7/18)

We will have to either move them to failing or fix them before this gets merged.

Fortunately it looks like the C and TypeScript tests are still passing. We're lucky that it was only four Python tests that are failing.

@petervdonovan
Copy link
Collaborator Author

Actually, this is a nondeterministic segmentation fault (sometimes it happens, usually it doesn't), so probably it affects other Python federated tests as well. It happens during garbage collection; here is the stack trace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  gc_reset_refs (
    refs=<error reading variable: Cannot access memory at address 0x10>, g=0x0)
    at ../Modules/gcmodule.c:105
105     ../Modules/gcmodule.c: No such file or directory.
[Current thread is 1 (Thread 0x7f78704c4000 (LWP 45104))]
(gdb) bt
#0  gc_reset_refs (
    refs=<error reading variable: Cannot access memory at address 0x10>, g=0x0)
    at ../Modules/gcmodule.c:105
#1  update_refs (containers=containers@entry=0x565479446e30)
    at ../Modules/gcmodule.c:427
#2  deduce_unreachable (base=base@entry=0x565479446e30, 
    unreachable=unreachable@entry=0x7ffe0006e530) at ../Modules/gcmodule.c:1104
#3  0x00005654786b9625 in gc_collect_main (tstate=0x565479462f40, generation=2, 
    n_collected=0x7ffe0006e5f8, n_uncollectable=0x7ffe0006e5f0, nofail=0)
    at ../Modules/gcmodule.c:1239
#4  0x00005654787c0f50 in gc_collect_with_callback (tstate=0x565479462f40, 
    generation=2) at ../Modules/gcmodule.c:1413
#5  0x00005654787f44be in PyGC_Collect () at ../Modules/gcmodule.c:2099
#6  0x00005654787f1efe in Py_FinalizeEx () at ../Python/pylifecycle.c:1781
#7  0x00005654787e1af3 in Py_RunMain () at ../Modules/main.c:668
#8  0x00005654787b7bcd in Py_BytesMain (argc=<optimized out>, argv=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--
    at ../Modules/main.c:720
#9  0x00007f7870229d90 in __libc_start_call_main (
    main=main@entry=0x5654787b7b90 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffe0006e8d8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#10 0x00007f7870229e40 in __libc_start_main_impl (main=0x5654787b7b90 <main>, argc=4, 
    argv=0x7ffe0006e8d8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffe0006e8c8) at ../csu/libc-start.c:392
#11 0x00005654787b7ac5 in _start ()

Could this be related to the problem that @jackykwok2024 is working on?

@lhstrh
Copy link
Member

lhstrh commented Jun 10, 2023

I would be in favor of moving these tests to failing and creating an issue documenting this. I agree that it seems that @jackykwok2024 would be the right person to look into this...

@jackyk02
Copy link
Collaborator

jackyk02 commented Jun 10, 2023

Hi @petervdonovan, thanks for the thorough description you've provided. From my understanding, the nondeterministic segmentation fault seems to be caused by the unordered main reactions occurring in Python's federated execution. This does correlate with the memory leak issue I'm currently examining. I will look into it next week.

@petervdonovan
Copy link
Collaborator Author

The reason why this has not been merged yet with the failing tests moved into failing is that I am worried that other tests will also prove to be flaky. This means that I am basically allowing all Python tests to be more or less disabled in master (because that is the status quo) so that we do not have to be bothered with more test failures.

@lhstrh
Copy link
Member

lhstrh commented Jun 14, 2023

I see. I guess the question is: is a fix for the memory leak within reach? And if not, what is the status, @jackykwok2024? Can we help?

@edwardalee
Copy link
Collaborator

I think this is also related to the fact that the Python programs segfault with Python 3.11 (issue #1458). I started looking into this, and put some notes in #1458, including some instructions on how to use the VS Code debugger with mixed Python and C. In my setup, I can set breakpoints in our C code and see variable values, etc.

@petervdonovan petervdonovan changed the title Fix the federated launch script (again) Fix the federated launch script Jun 14, 2023
@lhstrh lhstrh changed the title Fix the federated launch script Bug fix in the federated launch script to let failures be reported correctly Jul 11, 2023
@lhstrh lhstrh changed the title Bug fix in the federated launch script to let failures be reported correctly Bugfix in the federated launch script to let failures be reported correctly Jul 11, 2023
@petervdonovan petervdonovan changed the base branch from master to remove-unordered-reactions July 18, 2023 18:26
@petervdonovan petervdonovan force-pushed the remove-unordered-reactions branch 2 times, most recently from b90d63d to 054d466 Compare July 20, 2023 22:05
This fixes another way tests can pass when they should be failing.
@petervdonovan petervdonovan force-pushed the fix-launch-script-again branch from 35911dd to 351707b Compare July 21, 2023 04:17
Base automatically changed from remove-unordered-reactions to master July 27, 2023 22:52
@petervdonovan
Copy link
Collaborator Author

petervdonovan commented Jul 31, 2023

This change has already made it into master by way of the remove-unordered-reactions PR, so we can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants