Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 22.04 RDMAV_FORK_SAFE issue #81

Open
HazemAwadallah opened this issue Nov 6, 2024 · 0 comments
Open

Ubuntu 22.04 RDMAV_FORK_SAFE issue #81

HazemAwadallah opened this issue Nov 6, 2024 · 0 comments

Comments

@HazemAwadallah
Copy link

Running DLIO Benchmark on Ubuntu 22.04 with EFA results in a fork() compatibility error, causing crashes even when RDMAV_FORK_SAFE=1 is set. This issue does not occur on other OS versions, suggesting an Ubuntu 22.04-specific compatibility problem. The issue is most reproducible when running the unet3d workload on Ubuntu22.04 and is intermittent

root@sped:/home/sped# dmidecode -t system

dmidecode 3.3

Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0100, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7525
Version: Not Specified
Serial Number: 8HM86S3
UUID: 4c4c4544-0048-4d10-8038-b8c04f365333
Wake-up Type: Power Switch
SKU Number: SKU=08FF;ModelName=PowerEdge R7525
Family: PowerEdge

Handle 0x0C00, DMI type 12, 5 bytes
System Configuration Options
Option 1: NVRAM_CLR: Clear user settable NVRAM areas and set defaults
Option 2: PWRD_EN: Close to enable password

Handle 0x2000, DMI type 32, 11 bytes
System Boot Information
Status: No errors detected

root@sped:/home/sped# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@sped:/home/sped# uname -snr
Linux sped 6.5.0-15-generic

Details:

Environment:

OS: Ubuntu 22.04
MPI Library: mpi4py with MPI.COMM_WORLD
Networking: Elastic Fabric Adapter (EFA) with libfabric
Benchmark: DLIO Benchmark (I/O benchmark for deep learning applications)
Error Message:

The program crashes with the error message advising the use of RDMAV_FORK_SAFE=1 due to fork() compatibility issues with EFA, yet setting this variable does not resolve the crash on Ubuntu 22.04.
Steps Taken:

Exported RDMAV_FORK_SAFE=1.
Tested with mpiexec to pass RDMAV_FORK_SAFE directly to MPI.
Verified the compatibility of libfabric and mpi4py versions.
Possible Causes and Observations:

The issue seems specific to Ubuntu 22.04, possibly due to:
Kernel and glibc version incompatibilities.
Stricter memory management in Ubuntu 22.04’s libfabric or EFA library versions.
Profiling and data loading sections might be triggering unexpected fork() calls.

process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.
[INFO] 2024-10-26T02:59:47.952552 Starting block 1 [/home/sped/perf/mlperfv1/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:264]
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant