Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Partition tasks sometimes fail due to occupied ports #819

Open
2 tasks done
sdc17 opened this issue Jan 18, 2024 · 1 comment
Open
2 tasks done

[Bug] Partition tasks sometimes fail due to occupied ports #819

sdc17 opened this issue Jan 18, 2024 · 1 comment
Assignees

Comments

@sdc17
Copy link

sdc17 commented Jan 18, 2024

Hi, thanks for sharing this great open-source project! When using multiple GPUs for evaluation, I found partition tasks sometimes will fail due to occupied ports.

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))"
{'CUDA available': False,
 'GCC': 'gcc (GCC) 5.4.0',
 'MMEngine': '0.10.2',
 'OpenCV': '4.9.0',
 'PyTorch': '2.1.2+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.1.1 (Git Hash '
                              '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=old-style-cast '
                              '-Wno-invalid-partial-specialization '
                              '-Wno-unused-private-field '
                              '-Wno-aligned-allocation-unavailable '
                              '-Wno-missing-braces -fdiagnostics-color=always '
                              '-faligned-new -Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) '
           '[GCC 12.3.0]',
 'TorchVision': '0.16.2+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.1+61fe873',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

from mmengine.config import read_base

with read_base():
    # from .datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets 
    from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets

    from .models.hf_llama.hf_llama2_7b import models
    from .summarizers.example import summarizer

datasets = sum([v for k, v in locals().items() if k.endswith("_datasets") or k == 'datasets'], [])
work_dir = './outputs/llama-2-7b-hf'

Reproduces the problem - command or script

python run.py configs/eval_hf_llama2.py --max-partition-size 2000 # fail occurs
python run.py configs/eval_hf_llama2.py --max-partition-size 4000 # fail occurs
python run.py configs/eval_hf_llama2.py --max-partition-size 8000 # success

Reproduces the problem - error message

I ran the above scripts on 8 GPUs, and partition tasks sometimes will fail due to occupied ports. For example, with --max-partition-size 4000:

01/19 03:44:14 - OpenCompass - INFO - Partitioned into 45 tasks.
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
01/19 03:44:28 - OpenCompass - WARNING - task OpenICLInfer[llama-2-7b-hf/triviaqa_6] fail, see
./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out
100%|██████████| 45/45 [27:16<00:00, 36.37s/it]  
launch OpenICLInfer[llama-2-7b-hf/triviaqa_32] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_33] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_34] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_35] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_36] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_37] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_38] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_39] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_40] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_41] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_42] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_43] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_44] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_24] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_23] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_25] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_26] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_28] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_31] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_27] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_30] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_29] on GPU 3
01/19 04:11:31 - OpenCompass - ERROR - /home/user/opencompass/opencompass/runners/base.py - summarize - 63 - OpenICLInfer[llama-2-7b-hf/triviaqa_6] failed with code 1
01/19 04:11:31 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:13<00:00, 13.94s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU 
dataset    version    metric    mode    llama-2-7b-hf
---------  ---------  --------  ------  ---------------

The output shows that partition triviaqa_6 failed, and ./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out shows:

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:42265 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:42265 (errno: 98 - Address already in use).

which indicates this problem is caused by the occupied port. However, with --max-partition-size 8000, everything is ok:

01/19 03:47:33 - OpenCompass - INFO - Partitioned into 23 tasks.
100%|██████████| 23/23 [24:31<00:00, 63.99s/it]  
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 5
01/19 04:12:05 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:14<00:00, 14.82s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU 
dataset    version    metric    mode      llama-2-7b-hf
---------  ---------  --------  ------  ---------------
triviaqa   2121ce     score     gen               52.45

Other information

I think this is not a server/dataset-related problem. I have checked that there were no residual processes on the server occupying ports before running the scripts. Besides, I have tried to adjust the range of available port numbers but the same problem still occurred. Furthermore, I have also tested mmlu dataset with different value of --max-partition-size, and the same problem also occured from time to time.

Any solutions to fix would be appreciated!

BTW, the mmlu acc I tested is exactly matched with the value listed on website. However, the triviaqa acc I tested (52.4) is slightly lower than the reported (52.8). I'm using the default settings, and I'm wondering if this level of difference is normal? Thanks in advance!

@sdc17 sdc17 changed the title [Bug] Partition tasks sometimes fails due to occupied ports [Bug] Partition tasks sometimes fail due to occupied ports Jan 18, 2024
@bittersweet1999
Copy link
Collaborator

Hi, does the ports occupation is regular or random occurred

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants