Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory fault when running Spark on Gramine #1227

Closed
JaewonHur opened this issue Mar 11, 2023 · 11 comments
Closed

Memory fault when running Spark on Gramine #1227

JaewonHur opened this issue Mar 11, 2023 · 11 comments

Comments

@JaewonHur
Copy link
Contributor

Description of the problem

Hi,
Currently I'm trying to run Spark on Gramine, but whenever I run the application, it raises memory fault.
The memory fault occurs non-deterministic (but quite often at a specific pc), and I suspects it is related to the futex?

I tried debugging to find the root cause, but I could not find the clue.

In detail, the memory fault occurs frequently in the java compiler thread which synchronizes the accesses using CodeCacheLock.
I assume the bug should be related to the synchronization.
Could anyone help me getting the clue?

By the way, the first few memory faults on address 0x00000000 are not the bugs, which are also raised when running jvm on the native environment.

gramine-spark-stdout.txt
gramine-spark-trace.txt

Steps to reproduce

gramine version: v1.4.0
java version: openjdk-11
spark version: v3.3.2

If you want the minimized testcase, please inform me.

Expected results

no memory fault

Actual results

memory fault

Gramine commit hash

v1.4.0

@JaewonHur
Copy link
Contributor Author

The following code snippet raises a memory fault in jvm.

import java.io.IOException;

class Main {
    public static void main(String[] args) throws IOException {
        ProcessBuilder builder = new ProcessBuilder("/usr/bin/ls");
        Process process = builder.start();
    }
}

In my case, the memory fault was triggered within 10 trials.
It seems builder.start() (and the corresponding vfork and following instructions) is causing the memory fault.
After commenting out builder.start(), memory fault was not triggered after 100 trials.

I suspect that something goes wrong when gramine checkpoint the parent process while running multiple threads?

You can run the attached testcase as follow:

unzip testcase.zip
cd testcase
make
gramine-direct java Main

testcase.zip

@dimakuv
Copy link

dimakuv commented Mar 13, 2023

@JaewonHur This looks like a duplicate of this issue: #1156 ?

Please write back if it looks like you're hitting the above issue -- then I'll mark this one as a "duplicate".

Currently there is noone (to my knowledge) who is working on solving this bug in Gramine, but we'll try to find some resources on this.

@aneessahib
Copy link
Contributor

@TejaswineeL - please check if this is similar to what you are debugging.

@JaewonHur
Copy link
Contributor Author

JaewonHur commented Mar 13, 2023

@dimakuv I'm not sure, but it seems it is not related to #1156.

The bug was not triggered after modifying jvm to use only fork() instead of vfork().
I tried some tests while modifying jvm and glibc to use fork or vfork, and it seems the wrapper around fork system call (provided by glibc) helps mitigating the bug.
glibc does not provide any wrapper for vfork.

It can be the issue of the different semantics between fork and vfork ?
(while gramine handles both the same)

@TejaswineeL
Copy link
Contributor

The following code snippet raises a memory fault in jvm.

import java.io.IOException;

class Main {
    public static void main(String[] args) throws IOException {
        ProcessBuilder builder = new ProcessBuilder("/usr/bin/ls");
        Process process = builder.start();
    }
}

In my case, the memory fault was triggered within 10 trials. It seems builder.start() (and the corresponding vfork and following instructions) is causing the memory fault. After commenting out builder.start(), memory fault was not triggered after 100 trials.

I suspect that something goes wrong when gramine checkpoint the parent process while running multiple threads?

You can run the attached testcase as follow:

unzip testcase.zip
cd testcase
make
gramine-direct java Main

testcase.zip

@JaewonHur
In the gramine-spark-trace.txt, I observed below logs about checkpoint
Line 151180: (libos_checkpoint.c:725:receive_checkpoint_and_restore) debug: restored memory from checkpoint
Line 151181: (libos_checkpoint.c:392:receive_handles_on_stream) debug: receiving 227 PAL handles
Line 151182: (libos_checkpoint.c:358:restore_checkpoint) debug: restoring checkpoint at 0x7fff76000000 rebased from 0x7fff76000000
Line 151184: (libos_checkpoint.c:379:restore_checkpoint) [P2:T30:java] debug: successfully restored checkpoint at 0x7fff76000000 - 0x7fff761299a0

So, there does not seem anything wrong with checkpointing.

@JaewonHur
Copy link
Contributor Author

Sorry for the confusion, it seems not related to the checkpointing issue.

@llly
Copy link
Contributor

llly commented Mar 17, 2023

I didn't reproduce this fault using testcase.zip, either Gramine v1.4 in your log nor latest master.

@dimakuv
Copy link

dimakuv commented Mar 17, 2023

@llly Thanks for checking! So the test case just worked, without any issues?

@JaewonHur
Copy link
Contributor Author

When the java was built without debug enabled, it took longer to trigger the bug (about 50 trials).
But, with debug enabled, the bug was triggered in about 10 trials.

stdout-log.txt
stderr-log.txt

@dimakuv
Copy link

dimakuv commented Mar 20, 2023

@llly Could it be that you didn't run the test sufficiently many times to trigger this bug?

@dimakuv
Copy link

dimakuv commented Sep 25, 2024

Closing this issue, as it is 1.5 years old, and no follow ups happened.

@dimakuv dimakuv closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants