Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: strumpack test failed from E4S Testsuite #133

Closed
shahzebsiddiqui opened this issue Oct 3, 2022 · 9 comments
Closed

[Bug]: strumpack test failed from E4S Testsuite #133

shahzebsiddiqui opened this issue Oct 3, 2022 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@shahzebsiddiqui
Copy link
Contributor

CDASH Build

https://my.cdash.org/test/63278714

Link to buildspec file

https://github.com/buildtesters/buildtest-nersc/blob/devel/buildspecs/e4s/E4S-Testsuite/perlmutter/22.05/strumpack.yml

Please describe the issue?

see E4S-Project/testsuite#37

Relevant log output

REFINEMENT it. 0	res =      442.368	rel.res =            1	bw.error =            1
REFINEMENT it. 1	res =  7.89021e-14	rel.res =  1.78363e-16	bw.error =   5.8175e-16
# DIRECT/GMRES solve:
#   - abs_tol = 1e-10, rel_tol = 1e-06, restart = 30, maxit = 5000
#   - number of Krylov iterations = 1
#   - solve time = 0.0597916
# COMPONENTWISE SCALED RESIDUAL = 5.68529e-16
# relative error = ||x-x_exact||_F/||x_exact||_F = 3.96446e-15
./run.sh: line 7: 87374 Segmentation fault      ./testPoisson2d 100 --sp_disable_gpu
Run failed
@shahzebsiddiqui shahzebsiddiqui added the bug Something isn't working label Oct 3, 2022
@wspear
Copy link
Collaborator

wspear commented Oct 3, 2022

@pghysels : We are unable to get simple strumpack tests to run on perlmutter with either the public E4S 22.05 deployment or the in-progress 22.08 test deployment. Do you have any suggestions, or is there someone I should contact about the error above ( ./testPoisson2d 100 --sp_disable_gpu fails with a segmentation fault)

Ideally we could switch this test to use the internal spack test defined in the strumpack spack package, but this test also segfaults:


wspear@nid001496:~/SPACK-SPACE/wspear/perlmutter/22.05/buildtest-demo/testsuite> spack test run strumpack
==> Spack test 7w7mizoso7kw5xc23wqeucn6n4pwv5jk
==> Testing package strumpack-6.3.1-weoyzdx
==> Error: TestFailure: 1 tests failed.


Command exited with status 139:
    '/usr/bin/srun' '-n' '1' 'test_sparse_mpi' '../examples/sparse/data/pde900.mtx'
# Running with:
# OMP_NUM_THREADS=1 mpirun -n 1 ./test_sparse_mpi ../examples/sparse/data/pde900.mtx 
# opening file '../examples/sparse/data/pde900.mtx'
# %%MatrixMarket matrix coordinate real general
# reading 900 by 900 matrix with 4,380 nnz's from ../examples/sparse/data/pde900.mtx
# Initializing STRUMPACK
# using 1 OpenMP thread(s)
# using 1 MPI processes
# matching job: maximum matching with row and column scaling
# matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
# initial matrix:
#   - number of unknowns = 900
#   - number of nonzeros = 4,380
# nested dissection reordering:
#   - Metis reordering
#      - used METIS_NodeNDP (iso METIS_NodeND)
#      - supernodal tree from METIS_NodeNDP is used
#   - strategy parameter = 8
#   - number of separators = 113
#   - number of levels = 7
#   - nd time = 0.0181918
#   - matching time = 0.000250713
#   - symmetrization time = 8.36711e-05
# symbolic factorization:
#   - nr of dense Frontal matrices = 113
#   - symb-factor time = 0.0412207
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.287984 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 2.40525e-08
#   - replacing of small pivots is not enabled
#   - factor time = 0.00139133
#   - factor nonzeros = 35,998
#   - factor memory = 0.287984 MB
REFINEMENT it. 0	res =      144.537	rel.res =            1	bw.error =            1
REFINEMENT it. 1	res =  2.25225e-14	rel.res =  1.55825e-16	bw.error =  3.28764e-16
# DIRECT/GMRES solve:
#   - abs_tol = 1e-10, rel_tol = 1e-06, restart = 30, maxit = 5000
#   - number of Krylov iterations = 1
#   - solve time = 0.000996874
# COMPONENTWISE SCALED RESIDUAL = 3.7821e-16
# RELATIVE ERROR = 5.90568e-16
# opening file '../examples/sparse/data/pde900.mtx'
# %%MatrixMarket matrix coordinate real general
# reading 900 by 900 matrix with 4,380 nnz's from ../examples/sparse/data/pde900.mtx
# Initializing STRUMPACK
# using 1 OpenMP thread(s)
# using 1 MPI processes
# matching job: maximum matching with row and column scaling
# matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
# initial matrix:
#   - number of unknowns = 900
#   - number of nonzeros = 4,380
# nested dissection reordering:
#   - Metis reordering
#      - used METIS_NodeNDP (iso METIS_NodeND)
#      - supernodal tree from METIS_NodeNDP is used
#   - strategy parameter = 8
#   - number of separators = 113
#   - number of levels = 7
#   - nd time = 0.00495025
#   - matching time = 0.000205467
#   - symmetrization time = 4.8453e-05
# symbolic factorization:
#   - nr of dense Frontal matrices = 113
#   - symb-factor time = 0.000582224
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.287984 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 2.40525e-08
#   - replacing of small pivots is not enabled
#   - factor time = 0.000894937
#   - factor nonzeros = 35,998
#   - factor memory = 0.287984 MB
REFINEMENT it. 0	res =      144.537	rel.res =            1	bw.error =            1
REFINEMENT it. 1	res =  2.25225e-14	rel.res =  1.55825e-16	bw.error =  3.28764e-16
# DIRECT/GMRES solve:
#   - abs_tol = 1e-10, rel_tol = 1e-06, restart = 30, maxit = 5000
#   - number of Krylov iterations = 1
#   - solve time = 0.000336459
# COMPONENTWISE SCALED RESIDUAL = 3.7821e-16
# RELATIVE ERROR = 5.90568e-16
srun: error: nid001496: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=3319201.0



1 error found in test log:
     244    # DIRECT/GMRES solve:
     245    #   - abs_tol = 1e-10, rel_tol = 1e-06, restart = 30, maxit = 5000
     246    #   - number of Krylov iterations = 1
     247    #   - solve time = 0.000336459
     248    # COMPONENTWISE SCALED RESIDUAL = 3.7821e-16
     249    # RELATIVE ERROR = 5.90568e-16
  >> 250    srun: error: nid001496: task 0: Segmentation fault
     251    srun: launch/slurm: _step_signal: Terminating StepId=3319201.0
     252    
     253      File "/global/common/software/spackecp/perlmutter/e4s-22.05/spack/bin/spack", line 98, in <module>



/global/common/software/spackecp/perlmutter/e4s-22.05/spack/lib/spack/spack/build_environment.py:1076, in _setup_pkg_and_run:
       1073        tb_string = traceback.format_exc()
       1074
       1075        # build up some context from the offending package so we can
  >>   1076        # show that, too.
       1077        package_context = get_package_context(tb)
       1078
       1079        logfile = None

See test log for details:
  /pscratch/sd/w/wspear/perlmutter/spack_user_cache/test/7w7mizoso7kw5xc23wqeucn6n4pwv5jk/strumpack-6.3.1-weoyzdx-test-out.txt

@pghysels
Copy link
Contributor

pghysels commented Oct 3, 2022

That might be a problem with Cray BLAS, see here:
pghysels/STRUMPACK#70
https://bitbucket.org/icl/blaspp/issues/18/segfault-on-exit-with-cray-libsci

@shahzebsiddiqui
Copy link
Contributor Author

so would it help to do module unload cray-libsci to work around this issue? I guess if this test fails then maybe we should find a different test to run. @pghysels and @wspear can you please have a fix for this so we can get this test running correctly

@pghysels
Copy link
Contributor

pghysels commented Oct 5, 2022

Lisa Claus from NERSC recommends replacing cray-libsci/21.08.1.2 with cray-libsci/22.06.1.3 on Perlmutter.

@shahzebsiddiqui
Copy link
Contributor Author

yeah i spoke to @lisaclaus about this. Unfortunately PM is down today and i think Lisa will be out for some time, can either of you test this out once PM is back.

@pghysels
Copy link
Contributor

I can not reproduce this error on Perlmutter with cray-libsci/21.08.1.2.

But I had disabled CUDA, because again, CMake doesn't find CUDA::cublas etc. because those are in the math_libs folder. I need to figure that out.

@shahzebsiddiqui
Copy link
Contributor Author

E4S-Project/e4s#113

@shahzebsiddiqui
Copy link
Contributor Author

it looks like we got this test running on Jan 11 https://my.cdash.org/test/70656138?graph=status then it failed due to error in spack env activate command. We will run this again and if it succeeds we can close this issue

@shahzebsiddiqui shahzebsiddiqui moved this to In Progress in Test Failures Jan 31, 2023
@shahzebsiddiqui
Copy link
Contributor Author

@wspear looks like strumpack test pass see https://my.cdash.org/test/71903136 if this looks fine we can close this

@github-project-automation github-project-automation bot moved this from In Progress to Done in Test Failures Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants