You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for sharing this great open-source project! When using multiple GPUs for evaluation, I found partition tasks sometimes will fail due to occupied ports.
Prerequisite
I have searched Issues and Discussions but cannot get the expected help.
I ran the above scripts on 8 GPUs, and partition tasks sometimes will fail due to occupied ports. For example, with --max-partition-size 4000:
01/19 03:44:14 - OpenCompass - INFO - Partitioned into 45 tasks.
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
01/19 03:44:28 - OpenCompass - WARNING - task OpenICLInfer[llama-2-7b-hf/triviaqa_6] fail, see
./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out
100%|██████████| 45/45 [27:16<00:00, 36.37s/it]
launch OpenICLInfer[llama-2-7b-hf/triviaqa_32] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_33] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_34] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_35] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_36] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_37] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_38] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_39] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_40] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_41] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_42] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_43] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_44] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_24] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_23] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_25] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_26] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_28] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_31] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_27] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_30] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_29] on GPU 3
01/19 04:11:31 - OpenCompass - ERROR - /home/user/opencompass/opencompass/runners/base.py - summarize - 63 - OpenICLInfer[llama-2-7b-hf/triviaqa_6] failed with code 1
01/19 04:11:31 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:13<00:00, 13.94s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU
dataset version metric mode llama-2-7b-hf
--------- --------- -------- ------ ---------------
The output shows that partition triviaqa_6 failed, and ./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out shows:
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:42265 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:42265 (errno: 98 - Address already in use).
which indicates this problem is caused by the occupied port. However, with --max-partition-size 8000, everything is ok:
01/19 03:47:33 - OpenCompass - INFO - Partitioned into 23 tasks.
100%|██████████| 23/23 [24:31<00:00, 63.99s/it]
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 5
01/19 04:12:05 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:14<00:00, 14.82s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU
dataset version metric mode llama-2-7b-hf
--------- --------- -------- ------ ---------------
triviaqa 2121ce score gen 52.45
Other information
I think this is not a server/dataset-related problem. I have checked that there were no residual processes on the server occupying ports before running the scripts. Besides, I have tried to adjust the range of available port numbers but the same problem still occurred. Furthermore, I have also tested mmlu dataset with different value of --max-partition-size, and the same problem also occured from time to time.
Any solutions to fix would be appreciated!
BTW, the mmlu acc I tested is exactly matched with the value listed on website. However, the triviaqa acc I tested (52.4) is slightly lower than the reported (52.8). I'm using the default settings, and I'm wondering if this level of difference is normal? Thanks in advance!
The text was updated successfully, but these errors were encountered:
sdc17
changed the title
[Bug] Partition tasks sometimes fails due to occupied ports
[Bug] Partition tasks sometimes fail due to occupied ports
Jan 18, 2024
Hi, thanks for sharing this great open-source project! When using multiple GPUs for evaluation, I found partition tasks sometimes will fail due to occupied ports.
Prerequisite
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
Reproduces the problem - code/configuration sample
Reproduces the problem - command or script
Reproduces the problem - error message
I ran the above scripts on 8 GPUs, and partition tasks sometimes will fail due to occupied ports. For example, with
--max-partition-size 4000
:The output shows that partition
triviaqa_6
failed, and./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out
shows:which indicates this problem is caused by the occupied port. However, with
--max-partition-size 8000
, everything is ok:Other information
I think this is not a server/dataset-related problem. I have checked that there were no residual processes on the server occupying ports before running the scripts. Besides, I have tried to adjust the range of available port numbers but the same problem still occurred. Furthermore, I have also tested mmlu dataset with different value of
--max-partition-size
, and the same problem also occured from time to time.Any solutions to fix would be appreciated!
BTW, the mmlu acc I tested is exactly matched with the value listed on website. However, the triviaqa acc I tested (52.4) is slightly lower than the reported (52.8). I'm using the default settings, and I'm wondering if this level of difference is normal? Thanks in advance!
The text was updated successfully, but these errors were encountered: