-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LibOS] fork
in multi-threaded app may fail
#1156
Comments
@dimakuv I can look into this. Thank! |
So I think this assumption simplifies the solution a lot:
|
Hmm, I'm not sure this was my answer, or rather, not exactly this. Or maybe I explained it wrong, then I'm sorry. Problems:
Unfortunately I don't remember what idea I had back then to solve it, but the solution described above doesn't seem to work. To be honest, I remember only saying that if you do |
Yep, I understand that. But my proposal assumes that after After all, the whole problem in this issue is that handles and their associated FDs may change state during the checkpointing process, and cause crashes. So my proposal solves this in a drastic way: don't checkpoint handles and their associated FDs. If this works for the cases that we've encountered with some Java workloads, then it should be good enough. It's just like we emulate |
I think I've seen a bunch of them, it's a popular mechanism. I think our own loader uses it, gdb uses it and probably many more apps.
This sounds like something which should be solvable in a clean way, we just need to spend more time thinking / designing it IMO.
It's very different - at least from the POSIX level, semantically |
Based on some of the links that Dmitrii shared above my understanding is that it's not a safe behavior for applications unless they follow |
No, what our loader does was an example of descriptor inheritance being used, it's a normal thing to do on Linux. I gave it to show that we can't just drop most descriptors on
Because it breaks other apps which are using these mechanisms and will be hard to debug them, because it won't be obvious why they crashed. |
After a bit of thinking, I agree with @mkow. Even supporting Unfortunately, there seems to be no simple solution to this... |
What about trying to get upstreams to use |
But |
Use a patched glibc? |
What do you mean? What would that patch do? Regardless of it, remember that we support unmodified apps which don't have to use our modified glibc (and may even be statically compiled). |
Call a Gramine-specific
|
Description of the problem
A
fork()
in a multithreaded application may fail on sending PAL handles, here:gramine/libos/src/libos_checkpoint.c
Line 287 in 5fb3154
This is because the checkpoint happens in two steps:
migrate_func()
.migrate_func()
, the most important checkpointing part isBEGIN_CP_FUNC(thread)
->BEGIN_CP_FUNC(handle_map)
, at which point a lock on thehandle_map->lock
is acquired, and this lock guards all the sub-checkpointing logic for all known-to-process handles.migrate_func()
).send_handles_on_stream()
PalSendHandle()
on each PAL handle collected in the first step.PalSendHandle()
needs to peek inside the PAL handle object, and this is where our bug happens!The bug in question happens in multithreaded apps, where:
fork()
close(<some-fd>)
in-between the two above stepsThe
close(<some-fd>)
closes the PAL handle and destroys it. Then the first thread starts step 2, andPalSendHandle()
is given a garbage pointer (at which previously there was a PAL handle), and this function errors out.This is a classic use-after-free instance.
Additional materials:
Proposed solution is to utilize the RW global lock on checkpointing:
fork
+ immediateexecve
, so maybe we can assume this (that child immediately doesexecve
and doesn't need all the parent's state) in our workaround for this issue.create_process_and_send_checkpoint()
, somewhere right-beforemigrate_func()
, we:create_process_and_send_checkpoint()
exit, release the W lock.I think this solution is the cleanest and provides consistent view of the Gramine state (but not the consistent view of the app state! but here we can't do anything, it's a bug in the app then).
Steps to reproduce
I'm lazy to write a small reproducer, but here is a cleaned-up log from the big application:
The text was updated successfully, but these errors were encountered: