Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite hangs with -DGPU_SOLVE #130

Open
jeanlucf22 opened this issue Dec 19, 2022 · 4 comments
Open

Test suite hangs with -DGPU_SOLVE #130

jeanlucf22 opened this issue Dec 19, 2022 · 4 comments

Comments

@jeanlucf22
Copy link
Contributor

When I build superlu_dist with -DGPU_SOLVE in the C flags, the test suites seems to fail after printing out
.. B to X redistribute time 0.0001
.. Setup L-solve time 0.0000
.. L-solve time 0.0003
.. L-solve time (MAX) 0.0003
.. Setup U-solve time 0.0000

Test time = 1500.04 sec

(seems to reach time limit of 1500 seconds).

Taking out -DGPU_SOLVE of the build and the test suite runs fine.

@liuyangzhuan
Copy link
Collaborator

How many MPIs are you using?

@jeanlucf22
Copy link
Contributor Author

1

@jeanlucf22
Copy link
Contributor Author

I can reproduce the issue on Summit at OLCF.
SuperLU build:


#!/bin/bash
module load gcc/9.3.0
module load parmetis/4.0.3
module load metis
module load cuda
module load cmake
module load essl

rm -rf build
mkdir build
cd build

export PARMETIS_ROOT=$OLCF_PARMETIS_ROOT
export METIS_DIR=$OLCF_METIS_ROOT
export CUDA_BIN_PATH=$CUDA_PATH
export CUDAToolkit_ROOT=$CUDA_PATH
export CMAKE_PREFIX_PATH=${CMAKE_PREFIX_PATH}:${OPENMPI_ROOT}

cmake ..
-DCMAKE_BUILD_TYPE=Release
-DTPL_ENABLE_LAPACKLIB=on
-DCMAKE_C_FLAGS="-std=c99 -fPIC -DPRNTlevel=1 -DPROFlevel=1 -DGPU_SOLVE"
-DTPL_ENABLE_CUDALIB=on
-DCMAKE_C_COMPILER=mpicc
-DCMAKE_CXX_COMPILER=mpicxx
-DCMAKE_Fortran_COMPILER=mpifort
-DCMAKE_CUDA_COMPILER=${CUDA_BIN_PATH}/bin/nvcc
-D TPL_ENABLE_CUDALIB:BOOL=ON
-D CUDA_CUBLAS_LIBRARIES="${CUDA_BIN_PATH}/lib64/libcublas.so"
-D CMAKE_CUDA_ARCHITECTURES="70"
-D CMAKE_CUDA_HOST_COMPILER=mpicxx
-D CMAKE_CUDA_FLAGS:STRING="-ccbin mpicxx"
-DTPL_ENABLE_PARMETISLIB=on
-DTPL_PARMETIS_INCLUDE_DIRS="${PARMETIS_ROOT}/include;${METIS_DIR}/include"
-DTPL_PARMETIS_LIBRARIES="${PARMETIS_ROOT}/lib/libparmetis.so;${METIS_DIR}/lib/libmetis.so"
-DTPL_ENABLE_INTERNAL_BLASLIB=OFF
-DXSDK_ENABLE_Fortran=OFF
-DBUILD_SHARED_LIBS=on
-DCMAKE_INSTALL_PREFIX=.


Run script:

#!/bin/bash
#BSUB -q debug
#BSUB -P CSC304
#BSUB -J TestSuperLU
#BSUB -o TestSuperLU.o%J
#BSUB -W 0:45
#BSUB -nnodes 1
#BSUB -env "all"
export SUPERLU_ACC_OFFLOAD=1
export OMP_NUM_THREADS=1

cd build
make test


Result:

Test project /ccs/home/jeanluc/GIT/superlu_dist/build
Start 1: pdtest_1x1_1_2_8_20_SP
1/27 Test #1: pdtest_1x1_1_2_8_20_SP ...........***Timeout 1500.16 sec

@liuyangzhuan
Copy link
Collaborator

liuyangzhuan commented Jan 25, 2023

Thanks for providing these helpful instructions and I can reproduce the issue now. The problem was calling pdgssvx with nrhs=0 will skip some setups for GPU solves, which causes hanging when calling it later with nrhs>0 and options->Fact=FACTORED. This commit should fix the problem:
1aa8e65

However, the GPU solve in the master branch only support nmpi=1. You will still see the failures reported by "make test" when mpirun -np >1. I recommend not enabling GPU solve for the smoke/regression tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants