Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DLIO_PROFILER ERROR]: signal caught 6 if benchmark is run with RDMA mount and data_loader: dali #62

Open
alexander272272 opened this issue Apr 8, 2024 · 0 comments

Comments

@alexander272272
Copy link

alexander272272 commented Apr 8, 2024

mount:
IP:/ifs on /mnt/1/ifs type nfs (rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,forcerdirplus,proto=rdma,nconnect=24,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=IP,mountvers=3,mountproto=tcp,local_lock=none,addr=IP)

h# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

h# ./benchmark.sh run --hosts HOST --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir-$(date +"%d-%m-%Y") --param data
set.num_files_train=1200 --param dataset.data_folder=/mnt/1/ifs/data/rosnet50_05_04_2024_x02
[INFO] 2024-04-08T16:08:03.406678 Profiling DLIO /root/aan/storage/resultsdir-08-04-2024/trace-0-of-2.pfw [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-04-08T16:08:03.407010 Running DLIO with 2 process(es) [/root/aan/storage/dlio_benchmark/dlio_benchmark/main.py:98]
[INFO] 2024-04-08T16:08:03.635388 Max steps per epoch: 1876 = 1251 * 1200 / 400 / 2 (samples per file * num files / batch size / comm size) [/root/aan/storage/dlio_benchmark/dlio_benchm
ark/main.py:322]
[INFO] 2024-04-08T16:08:07.053113 Starting epoch 1: 1876 steps expected [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2024-04-08T16:08:07.053269 Starting block 1 [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198]
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

Your job will now abort.
[DLIO_PROFILER ERROR]: signal caught 6
/usr/local/lib/python3.10/dist-packages/dlio_profiler_py.cpython-310-x86_64-linux-gnu.so(+0x30325) [0x7f8db87ca325]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f8e0f8fd520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f8e0f9519fc]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f8e0f8fd476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f8e0f8e37f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e) [0x7f8db8a76b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeaf48) [0x7f8e0f9a5f48]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71) [0x7f8e0f9a5711]
python3(+0x287a6e) [0x56136209ba6e]
python3(+0x157a3e) [0x561361f6ba3e]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14]
python3(+0x164a64) [0x561361f78a64]
python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c]
python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14]
python3(+0x164b05) [0x561361f78b05]
python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c]
python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c]
python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc]
python3(PyObject_Call+0x122) [0x561361f7d492]
python3(_PyEval_EvalFrameDefault+0x2a27) [0x561361f595d7]
python3(+0x1687f1) [0x561361f7c7f1]
python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c]
^C[mpiexec@mpl078d] Sending Ctrl-C to processes as requested
[mpiexec@mpl078d] Press Ctrl-C again to force abort
[DLIO_PROFILER ERROR]: signal caught 2
[DLIO_PROFILER ERROR]: signal caught 2
^CCtrl-C caught... cleaning up processes

with RDMAV_FORK_SAFE=1 benchmark is running without exception but no load is generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant