[LibOS] `fork` in multi-threaded app may fail #1156

dimakuv · 2023-02-02T12:44:29Z

Description of the problem

A fork() in a multithreaded application may fail on sending PAL handles, here:

gramine/libos/src/libos_checkpoint.c

Line 287 in 5fb3154

ret = PalSendHandle(stream, entries[i]->handle);

This is because the checkpoint happens in two steps:

Calling migrate_func().
- In migrate_func(), the most important checkpointing part is BEGIN_CP_FUNC(thread) -> BEGIN_CP_FUNC(handle_map), at which point a lock on the handle_map->lock is acquired, and this lock guards all the sub-checkpointing logic for all known-to-process handles.
- This lock is what keeps the state of all checkpointed handles largely consistent (not completely, but at least the handles cannot be freed by other threads in the middle of migrate_func()).
Calling send_handles_on_stream()
- This performs PalSendHandle() on each PAL handle collected in the first step.
- PalSendHandle() needs to peek inside the PAL handle object, and this is where our bug happens!

The bug in question happens in multithreaded apps, where:

One thread does fork()
Another thread does e.g. close(<some-fd>) in-between the two above steps

The close(<some-fd>) closes the PAL handle and destroys it. Then the first thread starts step 2, and PalSendHandle() is given a garbage pointer (at which previously there was a PAL handle), and this function errors out.

This is a classic use-after-free instance.

Additional materials:

https://thorstenball.com/blog/2014/10/13/why-threads-cant-fork/
https://man7.org/linux/man-pages/man2/fork.2.html (search for "After a fork() in a multithreaded program")

Proposed solution is to utilize the RW global lock on checkpointing:

Upon entering Gramine codebase (in the syscall-enter path), every thread takes a R lock.
Upon exiting Gramine codebase (in the syscall-return path), the thread releases the R lock.
- Here I'm a bit fuzzy on the details: what happens e.g. on signal handling? Do we always have a single return path where we can plug in? Could there be a re-entrance into another syscall in the signal handler? Then we'll have nesting, and R lock will fail?
- UPDATE: @mkow thinks that blocking syscalls and signals will make my proposal too complex. @mkow says that the only reasonable flow for such multithreaded cases is fork + immediate execve, so maybe we can assume this (that child immediately does execve and doesn't need all the parent's state) in our workaround for this issue.
In create_process_and_send_checkpoint(), somewhere right-before migrate_func(), we:
Release the R lock (because I'm the same thread that grabbed it on syscall-enter).
Grab the W lock. (Here the thread will wait until all other threads are done with syscalls.)
On create_process_and_send_checkpoint() exit, release the W lock.

I think this solution is the cleanest and provides consistent view of the Gramine state (but not the consistent view of the app state! but here we can't do anything, it's a bug in the app then).

Steps to reproduce

I'm lazy to write a small reproducer, but here is a cleaned-up log from the big application:

[P1:T5:app] trace: ---- socket(INET, SOCK_NONBLOCK|SOCK_CLOEXEC|DGRAM, 0) = 19
[P1:T5:app] trace: ---- epoll_ctl(9, ADD, 19, {.events=EPOLLIN, .data=0x13}) = 0x0

[P1:T1:app] trace: ---- clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0, 0, 0x1a8078e4b350, 0) ...
[P1:T1:app] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P1:T1:app] debug: complete checkpointing data
[P1:T1:app] debug: checkpoint of 83288 bytes created

debug: Closing PAL Handle ffc12bb0
[P1:T5:app] trace: ---- close(19) = 0x0
[P1:T5:app] trace: ---- epoll_ctl(9, DEL, 19, {.events=EPOLLIN, .data=0x13}) = -9

[P1:T1:app] debug: Sending 33 PAL Handles
[P1:T1:app] debug: Sending 0 of 33 PAL Handles
...
[P1:T1:app] debug: Sending 17 of 33 PAL Handles
debug: In PalSendHandle ffc12bb0 of type 4290846496
error: ALERT: handle_serialize failed -4, handle type is 4290846496
[P1:T1:app] error: error -22 while sending PAL handle number 17

The text was updated successfully, but these errors were encountered:

kailun-qin · 2023-03-13T11:03:03Z

@dimakuv I can look into this. Thank!

dimakuv · 2023-03-13T11:06:12Z

UPDATE: @mkow thinks that blocking syscalls and signals will make my proposal too complex. @mkow says that the only reasonable flow for such multithreaded cases is fork + immediate execve, so maybe we can assume this (that child immediately does execve and doesn't need all the parent's state) in our workaround for this issue.

So I think this assumption simplifies the solution a lot:

We don't need RW lock at all, and we don't need to modify anything on the syscall emulation path in Gramine.
During fork emulation, we check whether this is a normal single-threaded program (in this case, we do what we did before in Gramine), but if this is a multi-threaded program, then we do a minimal version of checkpoint-and-restore:
a. We do not checkpoint any LibOS handles or FDs (well, maybe we'll need to checkpoint like stdout/stderr FDs? but I don't think so). This minimal logic already solves our initial bug.
b. We do not checkpoint most of the VMAs (memory regions). Maybe we only checkpoint the global memory (BSS + data segments) and the current-thread's stack.
The restore logic on the child doesn't need to be modified -- the child will consume this minimal checkpoint, restore itself to the state of the parent's thread (stack + some global vars + maybe environment variables + registers), and that should be more than enough to perform execve() immediately.

mkow · 2023-03-13T16:06:11Z

Hmm, I'm not sure this was my answer, or rather, not exactly this. Or maybe I explained it wrong, then I'm sorry.

Problems:

"the only reasonable flow for such multithreaded cases is fork + immediate execve" - that's right, but there may be a few trivial syscalls in between, we'd need to check what most apps do.
"that child immediately does execve and doesn't need all the parent's state" / "We do not checkpoint any LibOS handles or FDs" - hmm, I don't agree with this, even after execve() some state is needed, e.g. descriptors.
"We do not checkpoint most of the VMAs" - it's rather hard to say which memory regions will be used by the child between fork and execve? And if we miss anything then it will crash.

Unfortunately I don't remember what idea I had back then to solve it, but the solution described above doesn't seem to work. To be honest, I remember only saying that if you do fork() in a multithreaded app then you do execve() right after, but I don't remember saying any conclusions / suggestions how to use it solve this particular bug.

dimakuv · 2023-03-14T08:42:28Z

"that child immediately does execve and doesn't need all the parent's state" / "We do not checkpoint any LibOS handles or FDs" - hmm, I don't agree with this, even after execve() some state is needed, e.g. descriptors.

Yep, I understand that. But my proposal assumes that after execve(), no application would really use any file descriptors (and other than 0/1/2, I haven't seen applications relying on such a behavior, e.g., continue using some pipes from the old image). I agree that this may lead to problems in corner cases, but my intuition says it should be fine to skip handles/FDs.

After all, the whole problem in this issue is that handles and their associated FDs may change state during the checkpointing process, and cause crashes. So my proposal solves this in a drastic way: don't checkpoint handles and their associated FDs. If this works for the cases that we've encountered with some Java workloads, then it should be good enough.

It's just like we emulate vfork() via fork(), and it works absolutely fine for the last 5 years.

mkow · 2023-03-14T15:23:55Z

I haven't seen applications relying on such a behavior

I think I've seen a bunch of them, it's a popular mechanism. I think our own loader uses it, gdb uses it and probably many more apps.

After all, the whole problem in this issue is that handles and their associated FDs may change state during the checkpointing process

This sounds like something which should be solvable in a clean way, we just need to spend more time thinking / designing it IMO.

It's just like we emulate vfork() via fork(), and it works absolutely fine for the last 5 years.

It's very different - at least from the POSIX level, semantically fork() is a superset of vfork(), it's just slower. So, there shouldn't be any app abusing its real semantics and such emulation should be unrecognizable for the apps.

monavij · 2023-03-15T20:24:14Z

Based on some of the links that Dmitrii shared above my understanding is that it's not a safe behavior for applications unless they follow fork() with execv(). So if our loader is using it then we should fix it. Agree? But the solution that Dmitrii proposed above will get us covered for the issues we are seeing in most apps today and we do need a solution soon. So why don't we implement the minimal solution now and revisit it later if needed.

mkow · 2023-03-16T00:49:05Z

So if our loader is using it then we should fix it. Agree?

No, what our loader does was an example of descriptor inheritance being used, it's a normal thing to do on Linux. I gave it to show that we can't just drop most descriptors on fork(), even if it's followed by execve() right after.

So why don't we implement the minimal solution now and revisit it later if needed.

Because it breaks other apps which are using these mechanisms and will be hard to debug them, because it won't be obvious why they crashed.

dimakuv · 2023-03-16T08:45:05Z

After a bit of thinking, I agree with @mkow. Even supporting stdin/stdout/stderr in my proposal would require some hacks to checkpoint-and-restore exactly these FDs and their corresponding handles. And these FDs are definitely inherited by execve'd programs (e.g., in Bash-scripts scenarios). So I agree that we need proper checkpointing in this case anyway, then why not just make the whole mechanism correct from the beginning.

Unfortunately, there seems to be no simple solution to this...

DemiMarie · 2023-05-01T20:05:46Z

What about trying to get upstreams to use posix_spawn(), and in the meantime failing the fork() call?

mkow · 2023-05-05T01:17:08Z

But posix_spawn() is not a syscall, and Gramine works on the syscall layer. In glibc it's just a wrapper on top of clone and execve.

DemiMarie · 2023-05-05T01:55:25Z

Use a patched glibc?

mkow · 2023-05-05T02:10:38Z

What do you mean? What would that patch do?

Regardless of it, remember that we support unmodified apps which don't have to use our modified glibc (and may even be statically compiled).

DemiMarie · 2023-05-05T14:46:20Z

What do you mean? What would that patch do?

Call a Gramine-specific posix_spawn implementation.

Regardless of it, remember that we support unmodified apps which don't have to use our modified glibc (and may even be statically compiled).

fork() in Gramine will never be fast, so bypassing it could be a very useful optimization.

dimakuv added bug P: 1 labels Mar 9, 2023

dimakuv mentioned this issue Mar 13, 2023

Memory fault when running Spark on Gramine #1227

Closed

dimakuv assigned kailun-qin Mar 13, 2023

mkow mentioned this issue Apr 4, 2023

Host process becomes a zombie when calling execve from another than the initial thread #1265

Closed

This was referenced Oct 6, 2023

[LibOS] Implement a checkpoint rwlock to protect access to 'resources' #1592

Closed

[LibOS] Protect file descriptors and mmap'ed memory areas during checkpointing #1601

Open

MasterSkepticista mentioned this issue Dec 4, 2024

Simplify setup securefederatedai/openfl#1187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LibOS] `fork` in multi-threaded app may fail #1156

[LibOS] `fork` in multi-threaded app may fail #1156

dimakuv commented Feb 2, 2023 •

edited

Loading

kailun-qin commented Mar 13, 2023

dimakuv commented Mar 13, 2023 •

edited

Loading

mkow commented Mar 13, 2023

dimakuv commented Mar 14, 2023

mkow commented Mar 14, 2023

monavij commented Mar 15, 2023

mkow commented Mar 16, 2023

dimakuv commented Mar 16, 2023

DemiMarie commented May 1, 2023

mkow commented May 5, 2023

DemiMarie commented May 5, 2023

mkow commented May 5, 2023

DemiMarie commented May 5, 2023

[LibOS] fork in multi-threaded app may fail #1156

[LibOS] fork in multi-threaded app may fail #1156

Comments

dimakuv commented Feb 2, 2023 • edited Loading

Description of the problem

Steps to reproduce

kailun-qin commented Mar 13, 2023

dimakuv commented Mar 13, 2023 • edited Loading

mkow commented Mar 13, 2023

dimakuv commented Mar 14, 2023

mkow commented Mar 14, 2023

monavij commented Mar 15, 2023

mkow commented Mar 16, 2023

dimakuv commented Mar 16, 2023

DemiMarie commented May 1, 2023

mkow commented May 5, 2023

DemiMarie commented May 5, 2023

mkow commented May 5, 2023

DemiMarie commented May 5, 2023

[LibOS] `fork` in multi-threaded app may fail #1156

[LibOS] `fork` in multi-threaded app may fail #1156

dimakuv commented Feb 2, 2023 •

edited

Loading

dimakuv commented Mar 13, 2023 •

edited

Loading