Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice for how best to report a hang potentially caused by YJIT #557

Open
mikebaldry opened this issue May 25, 2024 · 7 comments
Open

Advice for how best to report a hang potentially caused by YJIT #557

mikebaldry opened this issue May 25, 2024 · 7 comments
Labels

Comments

@mikebaldry
Copy link

Apologies as this isn't a bug report itself, I wasn't sure of a better avenue to ask this question.

I've been trying to upgrade to ruby 3.3.1 and enabling YJIT, my test suite is hanging in seemingly random places. Looking at nrdebug.rb output, the ruby code is stopped on a call to super and looking at the base class super, nothing it does would cause any kind of blocking. The C thread has a stack trace stuck here:

#0  0x000075b990bd5021 in ?? ()
#1  0x0000000000000001 in ?? ()
#2  0x000075b944892458 in ?? ()
#3  0x000075b98dbfd100 in ?? ()
#4  0x000075b98df0c708 in jit_exec_exception (ec=<optimized out>) at vm.c:503
#5  vm_exec_loop (result=<optimized out>, tag=<optimized out>, state=<optimized out>, ec=<optimized out>) at vm.c:2512
#6  rb_vm_exec (ec=0x5c6485e72540) at vm.c:2489

I guess those first 4 are YJIT compiled code? How would I dig in and find out what it's doing?

I am able to reproduce it locally, though not able to make a small test case without some guidance. All I know is, if I don't call RubyVM::YJIT.enable on app boot, the hanging stops.

Can you suggest how best to investigate the issue to narrow it down enough to create a small reproduction, or what information might be valuable to you in tracking it down?

@maximecb
Copy link

Hi Mike,

If your test suite is small, it might be possible for you to dump what is getting compiled, which could give useful insights, with the --yjit-dump-insns command-line option. Not always practical if you have a lot of code though.

Another thing that could help in tracing the problem is, if you're able to run Ruby master and it doesn't have this bug, then that would tell us it's a bug we've fixed in Ruby 3.4.0. From then, doing a git bisect could help identify where the bug was fixed.

@XrXr and @k0kubun may have more insight.

@mikebaldry
Copy link
Author

Thanks, the suite is unfortunately about 19k tests, horribly inefficient and takes over 7 minutes to run across 75 CircleCI nodes. I'll have a check against master and see if it still happens!

@maximecb
Copy link

maximecb commented May 27, 2024

@mikebaldry thanks for your patience and help in tracking down the problem 🙏 :)

@XrXr do we have a way to get debug symbols for YJIT? Otherwise what's the best way to know which Ruby method it's hanging inside of?

@XrXr
Copy link

XrXr commented May 27, 2024

do we have a way to get debug symbols for YJIT? Otherwise what's the best way to know which Ruby method it's hanging inside of?

We don't generate debug symbols. Since it looks like Mike is able to attach a GDB to the hanged process, you can try call rb_backtrace() in the GDB session. You can also go by what you see from nrdebug.rb. Once you know where it hangs you can try progressively reducing the code that runs by liming to one test, repeating the test, removing code in the super call target or caller etc.

@maximecb
Copy link

@mikebaldry Ruby 3.3.2 was just released. Any chance you could give it a try and let us know if the problem still happens? https://www.ruby-lang.org/en/downloads/releases/

@mikebaldry
Copy link
Author

@mikebaldry Ruby 3.3.2 was just released. Any chance you could give it a try and let us know if the problem still happens? https://www.ruby-lang.org/en/downloads/releases/

Sorry, I did try with master and was having a whole host of problems even getting the bundle to install without CI running out of memory and getting killed. Then I got sidetracked with work things!

I've already got our CI set up with the tests running 3.2.2 but without YJIT enabled so in the morning I'll have a shot with it enabled and see if it happens again.

@mikebaldry
Copy link
Author

So the issue did still happen, but I decided to start removing things that I felt we're low level enough that they might cause some kind of conflict. The first thing I disabled was Coverband (which calls Coverage.start) and the issue has not happened again, so I think it could be something around the coverage code, though it didn't appear in the stack trace anywhere as far as I could see..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants