Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS] fork in multi-threaded app may fail #1156

Open
dimakuv opened this issue Feb 2, 2023 · 13 comments · May be fixed by #1601
Open

[LibOS] fork in multi-threaded app may fail #1156

dimakuv opened this issue Feb 2, 2023 · 13 comments · May be fixed by #1601
Assignees

Comments

@dimakuv
Copy link

dimakuv commented Feb 2, 2023

Description of the problem

A fork() in a multithreaded application may fail on sending PAL handles, here:

ret = PalSendHandle(stream, entries[i]->handle);

This is because the checkpoint happens in two steps:

  1. Calling migrate_func().
    • In migrate_func(), the most important checkpointing part is BEGIN_CP_FUNC(thread) -> BEGIN_CP_FUNC(handle_map), at which point a lock on the handle_map->lock is acquired, and this lock guards all the sub-checkpointing logic for all known-to-process handles.
    • This lock is what keeps the state of all checkpointed handles largely consistent (not completely, but at least the handles cannot be freed by other threads in the middle of migrate_func()).
  2. Calling send_handles_on_stream()
    • This performs PalSendHandle() on each PAL handle collected in the first step.
    • PalSendHandle() needs to peek inside the PAL handle object, and this is where our bug happens!

The bug in question happens in multithreaded apps, where:

  • One thread does fork()
  • Another thread does e.g. close(<some-fd>) in-between the two above steps

The close(<some-fd>) closes the PAL handle and destroys it. Then the first thread starts step 2, and PalSendHandle() is given a garbage pointer (at which previously there was a PAL handle), and this function errors out.

This is a classic use-after-free instance.

Additional materials:


Proposed solution is to utilize the RW global lock on checkpointing:

  1. Upon entering Gramine codebase (in the syscall-enter path), every thread takes a R lock.
  2. Upon exiting Gramine codebase (in the syscall-return path), the thread releases the R lock.
    • Here I'm a bit fuzzy on the details: what happens e.g. on signal handling? Do we always have a single return path where we can plug in? Could there be a re-entrance into another syscall in the signal handler? Then we'll have nesting, and R lock will fail?
    • UPDATE: @mkow thinks that blocking syscalls and signals will make my proposal too complex. @mkow says that the only reasonable flow for such multithreaded cases is fork + immediate execve, so maybe we can assume this (that child immediately does execve and doesn't need all the parent's state) in our workaround for this issue.
  3. In create_process_and_send_checkpoint(), somewhere right-before migrate_func(), we:
  4. Release the R lock (because I'm the same thread that grabbed it on syscall-enter).
  5. Grab the W lock. (Here the thread will wait until all other threads are done with syscalls.)
  6. On create_process_and_send_checkpoint() exit, release the W lock.

I think this solution is the cleanest and provides consistent view of the Gramine state (but not the consistent view of the app state! but here we can't do anything, it's a bug in the app then).

Steps to reproduce

I'm lazy to write a small reproducer, but here is a cleaned-up log from the big application:

[P1:T5:app] trace: ---- socket(INET, SOCK_NONBLOCK|SOCK_CLOEXEC|DGRAM, 0) = 19
[P1:T5:app] trace: ---- epoll_ctl(9, ADD, 19, {.events=EPOLLIN, .data=0x13}) = 0x0

[P1:T1:app] trace: ---- clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0, 0, 0x1a8078e4b350, 0) ...
[P1:T1:app] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P1:T1:app] debug: complete checkpointing data
[P1:T1:app] debug: checkpoint of 83288 bytes created

debug: Closing PAL Handle ffc12bb0
[P1:T5:app] trace: ---- close(19) = 0x0
[P1:T5:app] trace: ---- epoll_ctl(9, DEL, 19, {.events=EPOLLIN, .data=0x13}) = -9

[P1:T1:app] debug: Sending 33 PAL Handles
[P1:T1:app] debug: Sending 0 of 33 PAL Handles
...
[P1:T1:app] debug: Sending 17 of 33 PAL Handles
debug: In PalSendHandle ffc12bb0 of type 4290846496
error: ALERT: handle_serialize failed -4, handle type is 4290846496
[P1:T1:app] error: error -22 while sending PAL handle number 17
@kailun-qin
Copy link
Contributor

@dimakuv I can look into this. Thank!

@dimakuv
Copy link
Author

dimakuv commented Mar 13, 2023

UPDATE: @mkow thinks that blocking syscalls and signals will make my proposal too complex. @mkow says that the only reasonable flow for such multithreaded cases is fork + immediate execve, so maybe we can assume this (that child immediately does execve and doesn't need all the parent's state) in our workaround for this issue.

So I think this assumption simplifies the solution a lot:

  1. We don't need RW lock at all, and we don't need to modify anything on the syscall emulation path in Gramine.
  2. During fork emulation, we check whether this is a normal single-threaded program (in this case, we do what we did before in Gramine), but if this is a multi-threaded program, then we do a minimal version of checkpoint-and-restore:
    a. We do not checkpoint any LibOS handles or FDs (well, maybe we'll need to checkpoint like stdout/stderr FDs? but I don't think so). This minimal logic already solves our initial bug.
    b. We do not checkpoint most of the VMAs (memory regions). Maybe we only checkpoint the global memory (BSS + data segments) and the current-thread's stack.
  3. The restore logic on the child doesn't need to be modified -- the child will consume this minimal checkpoint, restore itself to the state of the parent's thread (stack + some global vars + maybe environment variables + registers), and that should be more than enough to perform execve() immediately.

@mkow
Copy link
Member

mkow commented Mar 13, 2023

Hmm, I'm not sure this was my answer, or rather, not exactly this. Or maybe I explained it wrong, then I'm sorry.

Problems:

  • "the only reasonable flow for such multithreaded cases is fork + immediate execve" - that's right, but there may be a few trivial syscalls in between, we'd need to check what most apps do.
  • "that child immediately does execve and doesn't need all the parent's state" / "We do not checkpoint any LibOS handles or FDs" - hmm, I don't agree with this, even after execve() some state is needed, e.g. descriptors.
  • "We do not checkpoint most of the VMAs" - it's rather hard to say which memory regions will be used by the child between fork and execve? And if we miss anything then it will crash.

Unfortunately I don't remember what idea I had back then to solve it, but the solution described above doesn't seem to work. To be honest, I remember only saying that if you do fork() in a multithreaded app then you do execve() right after, but I don't remember saying any conclusions / suggestions how to use it solve this particular bug.

@dimakuv
Copy link
Author

dimakuv commented Mar 14, 2023

  • "that child immediately does execve and doesn't need all the parent's state" / "We do not checkpoint any LibOS handles or FDs" - hmm, I don't agree with this, even after execve() some state is needed, e.g. descriptors.

Yep, I understand that. But my proposal assumes that after execve(), no application would really use any file descriptors (and other than 0/1/2, I haven't seen applications relying on such a behavior, e.g., continue using some pipes from the old image). I agree that this may lead to problems in corner cases, but my intuition says it should be fine to skip handles/FDs.

After all, the whole problem in this issue is that handles and their associated FDs may change state during the checkpointing process, and cause crashes. So my proposal solves this in a drastic way: don't checkpoint handles and their associated FDs. If this works for the cases that we've encountered with some Java workloads, then it should be good enough.

It's just like we emulate vfork() via fork(), and it works absolutely fine for the last 5 years.

@mkow
Copy link
Member

mkow commented Mar 14, 2023

I haven't seen applications relying on such a behavior

I think I've seen a bunch of them, it's a popular mechanism. I think our own loader uses it, gdb uses it and probably many more apps.

After all, the whole problem in this issue is that handles and their associated FDs may change state during the checkpointing process

This sounds like something which should be solvable in a clean way, we just need to spend more time thinking / designing it IMO.

It's just like we emulate vfork() via fork(), and it works absolutely fine for the last 5 years.

It's very different - at least from the POSIX level, semantically fork() is a superset of vfork(), it's just slower. So, there shouldn't be any app abusing its real semantics and such emulation should be unrecognizable for the apps.

@monavij
Copy link

monavij commented Mar 15, 2023

Based on some of the links that Dmitrii shared above my understanding is that it's not a safe behavior for applications unless they follow fork() with execv(). So if our loader is using it then we should fix it. Agree? But the solution that Dmitrii proposed above will get us covered for the issues we are seeing in most apps today and we do need a solution soon. So why don't we implement the minimal solution now and revisit it later if needed.

@mkow
Copy link
Member

mkow commented Mar 16, 2023

So if our loader is using it then we should fix it. Agree?

No, what our loader does was an example of descriptor inheritance being used, it's a normal thing to do on Linux. I gave it to show that we can't just drop most descriptors on fork(), even if it's followed by execve() right after.

So why don't we implement the minimal solution now and revisit it later if needed.

Because it breaks other apps which are using these mechanisms and will be hard to debug them, because it won't be obvious why they crashed.

@dimakuv
Copy link
Author

dimakuv commented Mar 16, 2023

After a bit of thinking, I agree with @mkow. Even supporting stdin/stdout/stderr in my proposal would require some hacks to checkpoint-and-restore exactly these FDs and their corresponding handles. And these FDs are definitely inherited by execve'd programs (e.g., in Bash-scripts scenarios). So I agree that we need proper checkpointing in this case anyway, then why not just make the whole mechanism correct from the beginning.

Unfortunately, there seems to be no simple solution to this...

@DemiMarie
Copy link

What about trying to get upstreams to use posix_spawn(), and in the meantime failing the fork() call?

@mkow
Copy link
Member

mkow commented May 5, 2023

But posix_spawn() is not a syscall, and Gramine works on the syscall layer. In glibc it's just a wrapper on top of clone and execve.

@DemiMarie
Copy link

Use a patched glibc?

@mkow
Copy link
Member

mkow commented May 5, 2023

What do you mean? What would that patch do?

Regardless of it, remember that we support unmodified apps which don't have to use our modified glibc (and may even be statically compiled).

@DemiMarie
Copy link

What do you mean? What would that patch do?

Call a Gramine-specific posix_spawn implementation.

Regardless of it, remember that we support unmodified apps which don't have to use our modified glibc (and may even be statically compiled).

fork() in Gramine will never be fast, so bypassing it could be a very useful optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment