You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mount:
IP:/ifs on /mnt/1/ifs type nfs (rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,forcerdirplus,proto=rdma,nconnect=24,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=IP,mountvers=3,mountproto=tcp,local_lock=none,addr=IP)
h# ./benchmark.sh run --hosts HOST --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir-$(date +"%d-%m-%Y") --param data
set.num_files_train=1200 --param dataset.data_folder=/mnt/1/ifs/data/rosnet50_05_04_2024_x02
[INFO] 2024-04-08T16:08:03.406678 Profiling DLIO /root/aan/storage/resultsdir-08-04-2024/trace-0-of-2.pfw [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-04-08T16:08:03.407010 Running DLIO with 2 process(es) [/root/aan/storage/dlio_benchmark/dlio_benchmark/main.py:98]
[INFO] 2024-04-08T16:08:03.635388 Max steps per epoch: 1876 = 1251 * 1200 / 400 / 2 (samples per file * num files / batch size / comm size) [/root/aan/storage/dlio_benchmark/dlio_benchm
ark/main.py:322]
[INFO] 2024-04-08T16:08:07.053113 Starting epoch 1: 1876 steps expected [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2024-04-08T16:08:07.053269 Starting block 1 [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198]
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
Your job will now abort.
[DLIO_PROFILER ERROR]: signal caught 6
/usr/local/lib/python3.10/dist-packages/dlio_profiler_py.cpython-310-x86_64-linux-gnu.so(+0x30325) [0x7f8db87ca325]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f8e0f8fd520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f8e0f9519fc]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f8e0f8fd476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f8e0f8e37f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e) [0x7f8db8a76b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeaf48) [0x7f8e0f9a5f48]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71) [0x7f8e0f9a5711]
python3(+0x287a6e) [0x56136209ba6e]
python3(+0x157a3e) [0x561361f6ba3e]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14]
python3(+0x164a64) [0x561361f78a64]
python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c]
python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14]
python3(+0x164b05) [0x561361f78b05]
python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c]
python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(PyObject_Call+0x122) [0x561361f7d492]
python3(_PyEval_EvalFrameDefault+0x2a27) [0x561361f595d7]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c]
^C[mpiexec@mpl078d] Sending Ctrl-C to processes as requested
[mpiexec@mpl078d] Press Ctrl-C again to force abort
[DLIO_PROFILER ERROR]: signal caught 2
[DLIO_PROFILER ERROR]: signal caught 2
^CCtrl-C caught... cleaning up processes
with RDMAV_FORK_SAFE=1 benchmark is running without exception but no load is generated
The text was updated successfully, but these errors were encountered:
mount:
IP:/ifs on /mnt/1/ifs type nfs (rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,forcerdirplus,proto=rdma,nconnect=24,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=IP,mountvers=3,mountproto=tcp,local_lock=none,addr=IP)
h# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
h# ./benchmark.sh run --hosts HOST --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir-$(date +"%d-%m-%Y") --param data
set.num_files_train=1200 --param dataset.data_folder=/mnt/1/ifs/data/rosnet50_05_04_2024_x02
[INFO] 2024-04-08T16:08:03.406678 Profiling DLIO /root/aan/storage/resultsdir-08-04-2024/trace-0-of-2.pfw [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-04-08T16:08:03.407010 Running DLIO with 2 process(es) [/root/aan/storage/dlio_benchmark/dlio_benchmark/main.py:98]
[INFO] 2024-04-08T16:08:03.635388 Max steps per epoch: 1876 = 1251 * 1200 / 400 / 2 (samples per file * num files / batch size / comm size) [/root/aan/storage/dlio_benchmark/dlio_benchm
ark/main.py:322]
[INFO] 2024-04-08T16:08:07.053113 Starting epoch 1: 1876 steps expected [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2024-04-08T16:08:07.053269 Starting block 1 [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198]
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
Your job will now abort.
[DLIO_PROFILER ERROR]: signal caught 6
/usr/local/lib/python3.10/dist-packages/dlio_profiler_py.cpython-310-x86_64-linux-gnu.so(+0x30325) [0x7f8db87ca325]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f8e0f8fd520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f8e0f9519fc]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f8e0f8fd476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f8e0f8e37f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e) [0x7f8db8a76b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeaf48) [0x7f8e0f9a5f48]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71) [0x7f8e0f9a5711]
python3(+0x287a6e) [0x56136209ba6e]
python3(+0x157a3e) [0x561361f6ba3e]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14]
python3(+0x164a64) [0x561361f78a64]
python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c]
python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14]
python3(+0x164b05) [0x561361f78b05]
python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c]
python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(PyObject_Call+0x122) [0x561361f7d492]
python3(_PyEval_EvalFrameDefault+0x2a27) [0x561361f595d7]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c]
^C[mpiexec@mpl078d] Sending Ctrl-C to processes as requested
[mpiexec@mpl078d] Press Ctrl-C again to force abort
[DLIO_PROFILER ERROR]: signal caught 2
[DLIO_PROFILER ERROR]: signal caught 2
^CCtrl-C caught... cleaning up processes
with RDMAV_FORK_SAFE=1 benchmark is running without exception but no load is generated
The text was updated successfully, but these errors were encountered: