-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rr --wait stuck not reaping zombie process #3882
Comments
This is the ptrace-stop that the thread is blocked in, but it's probably not very surprising
|
OK - I think I kind of see what's going on. I managed to make the problem happen again with
I think I can conclude from this:
[1] the program being traced here actually has a handler for SIGSEGV, but the handler calls |
If we see the PTRACE_EVENT_EXIT for a task while running a different task in unlimited-ticks mode in `Scheduler::reschedule`, it looks like nothing ever actually calls `handle_ptrace_exit_event` on it, and so nothing ever PTRACE_CONT's the task out of the exit-stop and into the zombie state. This seems to manifest itself as rr not reaping processes properly when they receive asynchronous core-dumping signals (e.g. SIGSEGV sent by `raise` or `kill`). Fix this issue by checking if there's a pending PTRACE_EVENT_EXIT to deal with on the task in `Scheduler::is_task_runnable`, and allowing the task to be executed if so. Fixes rr-debugger#3882
☝️ that PR seems to fix my problem, but I'll leave it running on my laptop overnight tonight to make sure :) EDIT: Confirmed, this does seem to fix my problem. |
I've got a very occasional issue where
rr --wait
is not properly reaping some child processes that have exited, and thus rr itself is stuck forever and not exiting.The situation looks like this. There's an rr process, pid 1566, which has no traced children except for the defunct zombie process 1675.
Process 1675 has two threads. The thread-group-leader is in a zombie state, but the second thread is in a ptrace-stop:
I attached gdb to rr itself, to investigate why it got stuck into this state. rr is tracking both of these two threads still:
The thread-group-leader cannot be reaped because rr needs to reap the other threads first (
rr/src/RecordTask.cc
Lines 2146 to 2151 in f7067f1
so the question we need to investigate is, why is the thread 1681 not yet reaped? We've seen it's PTRACE_EVENT_EXIT - that's the tracing-stop we're currently in.
(394623 >> 16) & 0xFF == 6
which isPTRACE_EVENT_EXIT
- https://elixir.bootlin.com/linux/v6.11.6/source/include/uapi/linux/ptrace.h#L161If we look at some of its instance variables:
There are only two places that
seen_ptrace_exit_event_
can be set:rr/src/Task.cc
Line 483 in f7067f1
handled_ptrace_exit_event_
would also be true, which it's notrr/src/Task.cc
Line 2420 in f7067f1
was_reaped_
is false too, which it is).rr itself is blocked in this call to
waitid(-1)
:This can't make any forward progress, because what needs to happen (I think) is that we need to continue thread 1681 so it leaves the ptrace-stop and makes it to a zombie state, which will generate another wait notification with the status and allow everything to be cleaned up.
I'm not entirely sure how we got into this state, but i suspect the fact that
waiting_for_ptrace_exit == true
is part of the story - that gets set when a core-dumping signal is sent to the process (rr/src/RecordSession.cc
Line 1884 in f7067f1
I'll try and debug this a bit further this weekend but I wondered if this triggered any spidy-senses about what to look for here 🤔
The text was updated successfully, but these errors were encountered: