You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running DLIO Benchmark on Ubuntu 22.04 with EFA results in a fork() compatibility error, causing crashes even when RDMAV_FORK_SAFE=1 is set. This issue does not occur on other OS versions, suggesting an Ubuntu 22.04-specific compatibility problem. The issue is most reproducible when running the unet3d workload on Ubuntu22.04 and is intermittent
root@sped:/home/sped# dmidecode -t system
dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
Handle 0x0100, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7525
Version: Not Specified
Serial Number: 8HM86S3
UUID: 4c4c4544-0048-4d10-8038-b8c04f365333
Wake-up Type: Power Switch
SKU Number: SKU=08FF;ModelName=PowerEdge R7525
Family: PowerEdge
Handle 0x0C00, DMI type 12, 5 bytes
System Configuration Options
Option 1: NVRAM_CLR: Clear user settable NVRAM areas and set defaults
Option 2: PWRD_EN: Close to enable password
Handle 0x2000, DMI type 32, 11 bytes
System Boot Information
Status: No errors detected
OS: Ubuntu 22.04
MPI Library: mpi4py with MPI.COMM_WORLD
Networking: Elastic Fabric Adapter (EFA) with libfabric
Benchmark: DLIO Benchmark (I/O benchmark for deep learning applications)
Error Message:
The program crashes with the error message advising the use of RDMAV_FORK_SAFE=1 due to fork() compatibility issues with EFA, yet setting this variable does not resolve the crash on Ubuntu 22.04.
Steps Taken:
Exported RDMAV_FORK_SAFE=1.
Tested with mpiexec to pass RDMAV_FORK_SAFE directly to MPI.
Verified the compatibility of libfabric and mpi4py versions.
Possible Causes and Observations:
The issue seems specific to Ubuntu 22.04, possibly due to:
Kernel and glibc version incompatibilities.
Stricter memory management in Ubuntu 22.04’s libfabric or EFA library versions.
Profiling and data loading sections might be triggering unexpected fork() calls.
process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
[INFO] 2024-10-26T02:59:47.952552 Starting block 1 [/home/sped/perf/mlperfv1/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:264]
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
The text was updated successfully, but these errors were encountered:
Running DLIO Benchmark on Ubuntu 22.04 with EFA results in a fork() compatibility error, causing crashes even when RDMAV_FORK_SAFE=1 is set. This issue does not occur on other OS versions, suggesting an Ubuntu 22.04-specific compatibility problem. The issue is most reproducible when running the unet3d workload on Ubuntu22.04 and is intermittent
root@sped:/home/sped# dmidecode -t system
dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
Handle 0x0100, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7525
Version: Not Specified
Serial Number: 8HM86S3
UUID: 4c4c4544-0048-4d10-8038-b8c04f365333
Wake-up Type: Power Switch
SKU Number: SKU=08FF;ModelName=PowerEdge R7525
Family: PowerEdge
Handle 0x0C00, DMI type 12, 5 bytes
System Configuration Options
Option 1: NVRAM_CLR: Clear user settable NVRAM areas and set defaults
Option 2: PWRD_EN: Close to enable password
Handle 0x2000, DMI type 32, 11 bytes
System Boot Information
Status: No errors detected
root@sped:/home/sped# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@sped:/home/sped# uname -snr
Linux sped 6.5.0-15-generic
Details:
Environment:
OS: Ubuntu 22.04
MPI Library: mpi4py with MPI.COMM_WORLD
Networking: Elastic Fabric Adapter (EFA) with libfabric
Benchmark: DLIO Benchmark (I/O benchmark for deep learning applications)
Error Message:
The program crashes with the error message advising the use of RDMAV_FORK_SAFE=1 due to fork() compatibility issues with EFA, yet setting this variable does not resolve the crash on Ubuntu 22.04.
Steps Taken:
Exported RDMAV_FORK_SAFE=1.
Tested with mpiexec to pass RDMAV_FORK_SAFE directly to MPI.
Verified the compatibility of libfabric and mpi4py versions.
Possible Causes and Observations:
The issue seems specific to Ubuntu 22.04, possibly due to:
Kernel and glibc version incompatibilities.
Stricter memory management in Ubuntu 22.04’s libfabric or EFA library versions.
Profiling and data loading sections might be triggering unexpected fork() calls.
process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
[INFO] 2024-10-26T02:59:47.952552 Starting block 1 [/home/sped/perf/mlperfv1/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:264]
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
The text was updated successfully, but these errors were encountered: