Advice for how best to report a hang potentially caused by YJIT #557

mikebaldry · 2024-05-25T16:42:48Z

Apologies as this isn't a bug report itself, I wasn't sure of a better avenue to ask this question.

I've been trying to upgrade to ruby 3.3.1 and enabling YJIT, my test suite is hanging in seemingly random places. Looking at nrdebug.rb output, the ruby code is stopped on a call to super and looking at the base class super, nothing it does would cause any kind of blocking. The C thread has a stack trace stuck here:

#0  0x000075b990bd5021 in ?? ()
#1  0x0000000000000001 in ?? ()
#2  0x000075b944892458 in ?? ()
#3  0x000075b98dbfd100 in ?? ()
#4  0x000075b98df0c708 in jit_exec_exception (ec=<optimized out>) at vm.c:503
#5  vm_exec_loop (result=<optimized out>, tag=<optimized out>, state=<optimized out>, ec=<optimized out>) at vm.c:2512
#6  rb_vm_exec (ec=0x5c6485e72540) at vm.c:2489

I guess those first 4 are YJIT compiled code? How would I dig in and find out what it's doing?

I am able to reproduce it locally, though not able to make a small test case without some guidance. All I know is, if I don't call RubyVM::YJIT.enable on app boot, the hanging stops.

Can you suggest how best to investigate the issue to narrow it down enough to create a small reproduction, or what information might be valuable to you in tracking it down?

The text was updated successfully, but these errors were encountered:

maximecb · 2024-05-27T15:10:48Z

Hi Mike,

If your test suite is small, it might be possible for you to dump what is getting compiled, which could give useful insights, with the --yjit-dump-insns command-line option. Not always practical if you have a lot of code though.

Another thing that could help in tracing the problem is, if you're able to run Ruby master and it doesn't have this bug, then that would tell us it's a bug we've fixed in Ruby 3.4.0. From then, doing a git bisect could help identify where the bug was fixed.

@XrXr and @k0kubun may have more insight.

mikebaldry · 2024-05-27T15:17:05Z

Thanks, the suite is unfortunately about 19k tests, horribly inefficient and takes over 7 minutes to run across 75 CircleCI nodes. I'll have a check against master and see if it still happens!

maximecb · 2024-05-27T15:20:40Z

@mikebaldry thanks for your patience and help in tracking down the problem 🙏 :)

@XrXr do we have a way to get debug symbols for YJIT? Otherwise what's the best way to know which Ruby method it's hanging inside of?

XrXr · 2024-05-27T15:29:01Z

do we have a way to get debug symbols for YJIT? Otherwise what's the best way to know which Ruby method it's hanging inside of?

We don't generate debug symbols. Since it looks like Mike is able to attach a GDB to the hanged process, you can try call rb_backtrace() in the GDB session. You can also go by what you see from nrdebug.rb. Once you know where it hangs you can try progressively reducing the code that runs by liming to one test, repeating the test, removing code in the super call target or caller etc.

maximecb · 2024-05-30T18:42:03Z

@mikebaldry Ruby 3.3.2 was just released. Any chance you could give it a try and let us know if the problem still happens? https://www.ruby-lang.org/en/downloads/releases/

mikebaldry · 2024-05-30T18:45:07Z

@mikebaldry Ruby 3.3.2 was just released. Any chance you could give it a try and let us know if the problem still happens? https://www.ruby-lang.org/en/downloads/releases/

Sorry, I did try with master and was having a whole host of problems even getting the bundle to install without CI running out of memory and getting killed. Then I got sidetracked with work things!

I've already got our CI set up with the tests running 3.2.2 but without YJIT enabled so in the morning I'll have a shot with it enabled and see if it happens again.

mikebaldry · 2024-05-31T08:17:37Z

So the issue did still happen, but I decided to start removing things that I felt we're low level enough that they might cause some kind of conflict. The first thing I disabled was Coverband (which calls Coverage.start) and the issue has not happened again, so I think it could be something around the coverage code, though it didn't appear in the stack trace anywhere as far as I could see..

maximecb added the YJIT-bug label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for how best to report a hang potentially caused by YJIT #557

Advice for how best to report a hang potentially caused by YJIT #557

mikebaldry commented May 25, 2024

maximecb commented May 27, 2024

mikebaldry commented May 27, 2024

maximecb commented May 27, 2024 •

edited

Loading

XrXr commented May 27, 2024

maximecb commented May 30, 2024

mikebaldry commented May 30, 2024

mikebaldry commented May 31, 2024

Advice for how best to report a hang potentially caused by YJIT #557

Advice for how best to report a hang potentially caused by YJIT #557

Comments

mikebaldry commented May 25, 2024

maximecb commented May 27, 2024

mikebaldry commented May 27, 2024

maximecb commented May 27, 2024 • edited Loading

XrXr commented May 27, 2024

maximecb commented May 30, 2024

mikebaldry commented May 30, 2024

mikebaldry commented May 31, 2024

maximecb commented May 27, 2024 •

edited

Loading