Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IcoswISC240_WOA23_performance_test failing on Chicoma #882

Open
altheaden opened this issue Jan 10, 2025 · 3 comments
Open

IcoswISC240_WOA23_performance_test failing on Chicoma #882

altheaden opened this issue Jan 10, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@altheaden
Copy link
Collaborator

IcoswISC240_WOA23_performance_test is failing on Chicoma.
Test log:

compass calling: compass.ocean.tests.global_ocean.performance_test.PerformanceTest.run()
  inherited from: compass.testcase.TestCase.run()
  in /users/althea/code/compass/main/compass/testcase.py

compass calling: compass.run.serial._run_test()
  in /users/althea/code/compass/main/compass/run/serial.py

Running steps:
  prognostic_ice_shelf_melt
  data_ice_shelf_melt

  * step: prognostic_ice_shelf_melt

compass calling: compass.ocean.tests.global_ocean.forward.ForwardStep.runtime_setup()
  in /users/althea/code/compass/main/compass/ocean/tests/global_ocean/forward.py

Warning: replacing namelist options in namelist.ocean
config_dt = 02:00:00
config_btr_dt = 00:06:00

compass calling: compass.ocean.tests.global_ocean.forward.ForwardStep.run()
  in /users/althea/code/compass/main/compass/ocean/tests/global_ocean/forward.py

Warning: replacing namelist options in namelist.ocean
config_pio_num_iotasks = 1
config_pio_stride = 36
Running: gpmetis graph.info 36
******************************************************************************
METIS 5.0 Copyright 1998-13, Regents of the University of Minnesota
 (HEAD: , Built on: Jan  8 2025, 16:43:49)
 size of idx_t: 64bits, real_t: 64bits, idx_t *: 64bits

Graph Information -----------------------------------------------------------
 Name: graph.info, #Vertices: 7301, #Edges: 21002, #Parts: 36

Options ---------------------------------------------------------------------
 ptype=kway, objtype=cut, ctype=shem, rtype=greedy, iptype=metisrb
 dbglvl=0, ufactor=1.030, no2hop=NO, minconn=NO, contig=NO, nooutput=NO
 seed=-1, niter=10, ncuts=1

Direct k-way Partitioning ---------------------------------------------------
 - Edgecut: 1446, communication volume: 1535.

 - Balance:
     constraint #0:  1.026 out of 0.005

 - Most overweight partition:
     pid: 25, actual: 208, desired: 202, ratio: 1.03.

 - Subdomain connectivity: max: 6, min: 2, avg: 4.33

 - Each partition is contiguous.

Timing Information ----------------------------------------------------------
  I/O:          		   0.004 sec
  Partitioning: 		   0.016 sec   (METIS time)
  Reporting:    		   0.001 sec

Memory Information ----------------------------------------------------------
  Max memory used:		   1.575 MB
******************************************************************************

Running: srun -c 1 -N 1 -n 36 ./ocean_model -n namelist.ocean -s streams.ocean
PE 0: MPICH processor detected:
PE 0:   AMD Rome (23:49:0) (family:model:stepping)
MPI VERSION    : CRAY MPICH version 8.1.28.29 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 15 20:57 2023 (git hash 1cde46f) (CH4)
PE 0: MPICH environment settings =====================================
PE 0:   MPICH_ENV_DISPLAY                              = 1
PE 0:   MPICH_VERSION_DISPLAY                          = 1
PE 0:   MPICH_ABORT_ON_ERROR                           = 0
PE 0:   MPICH_CPUMASK_DISPLAY                          = 0
PE 0:   MPICH_STATS_DISPLAY                            = 0
PE 0:   MPICH_RANK_REORDER_METHOD                      = 1
PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0
PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0
PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0
PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1
PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096
PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED
PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT
PE 0:   MPICH_MALLOC_FALLBACK                          = 0
PE 0:   MPICH_MEM_DEBUG_FNAME                          = 
PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT
PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0
PE 0:   MPICH_COLL_SYNC                                = MPI_Bcast
PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1
PE 0:   MPICH_USE_PERSISTENT_TOPS                      = 0
PE 0:   MPICH_DISABLE_PERSISTENT_RECV_TOPS             = 0
PE 0:   MPICH_MAX_TOPS_COUNTERS                        = 0
PE 0:   MPICH_ENABLE_ACTIVE_WAIT                       = 0
PE 0: MPICH/RMA environment settings =================================
PE 0:   MPICH_RMA_MAX_PENDING                          = 128
PE 0:   MPICH_RMA_SHM_ACCUMULATE                       = 0
PE 0: MPICH/Dynamic Process Management environment settings ==========
PE 0:   MPICH_DPM_DIR                                  = 
PE 0:   MPICH_LOCAL_SPAWN_SERVER                       = 0
PE 0:   MPICH_SPAWN_USE_RANKPOOL                       = 0
PE 0: MPICH/SMP environment settings =================================
PE 0:   MPICH_SMP_SINGLE_COPY_MODE                     = XPMEM
PE 0:   MPICH_SMP_SINGLE_COPY_SIZE                     = 8192
PE 0:   MPICH_SHM_PROGRESS_MAX_BATCH_SIZE              = 8
PE 0: MPICH/COLLECTIVE environment settings ==========================
PE 0:   MPICH_COLL_OPT_OFF                             = 0
PE 0:   MPICH_BCAST_ONLY_TREE                          = 1
PE 0:   MPICH_BCAST_INTERNODE_RADIX                    = 4
PE 0:   MPICH_BCAST_INTRANODE_RADIX                    = 4
PE 0:   MPICH_ALLTOALL_SHORT_MSG                       = 64-512
PE 0:   MPICH_ALLTOALL_SYNC_FREQ                       = 1-24
PE 0:   MPICH_ALLTOALLV_THROTTLE                       = 8
PE 0:   MPICH_ALLGATHER_VSHORT_MSG                     = 1024-4096
PE 0:   MPICH_ALLGATHERV_VSHORT_MSG                    = 1024-4096
PE 0:   MPICH_GATHERV_SHORT_MSG                        = 131072
PE 0:   MPICH_GATHERV_MIN_COMM_SIZE                    = 64
PE 0:   MPICH_GATHERV_MAX_TMP_SIZE                     = 536870912
PE 0:   MPICH_GATHERV_SYNC_FREQ                        = 16
PE 0:   MPICH_IGATHERV_MIN_COMM_SIZE                   = 1000
PE 0:   MPICH_IGATHERV_SYNC_FREQ                       = 100
PE 0:   MPICH_IGATHERV_RAND_COMMSIZE                   = 2048
PE 0:   MPICH_IGATHERV_RAND_RECVLIST                   = 0
PE 0:   MPICH_SCATTERV_SHORT_MSG                       = 2048-8192
PE 0:   MPICH_SCATTERV_MIN_COMM_SIZE                   = 64
PE 0:   MPICH_SCATTERV_MAX_TMP_SIZE                    = 536870912
PE 0:   MPICH_SCATTERV_SYNC_FREQ                       = 16
PE 0:   MPICH_SCATTERV_SYNCHRONOUS                     = 0
PE 0:   MPICH_ALLREDUCE_MAX_SMP_SIZE                   = 262144
PE 0:   MPICH_ALLREDUCE_BLK_SIZE                       = 716800
PE 0:   MPICH_GPU_ALLGATHER_VSHORT_MSG_ALGORITHM       = 1
PE 0:   MPICH_GPU_ALLREDUCE_USE_KERNEL                 = 0
PE 0:   MPICH_GPU_COLL_STAGING_BUF_SIZE                = 1048576
PE 0:   MPICH_GPU_ALLREDUCE_STAGING_THRESHOLD          = 256
PE 0:   MPICH_ALLREDUCE_NO_SMP                         = 0
PE 0:   MPICH_REDUCE_NO_SMP                            = 0
PE 0:   MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288
PE 0:   MPICH_REDUCE_SCATTER_MAX_COMMSIZE              = 1000
PE 0:   MPICH_SHARED_MEM_COLL_OPT                      = 1
PE 0:   MPICH_SHARED_MEM_COLL_NCELLS                   = 8
PE 0:   MPICH_SHARED_MEM_COLL_CELLSZ                   = 256
PE 0: MPICH MPIIO environment settings ===============================
PE 0:   MPICH_MPIIO_HINTS_DISPLAY                      = 0
PE 0:   MPICH_MPIIO_HINTS                              = NULL
PE 0:   MPICH_MPIIO_ABORT_ON_RW_ERROR                  = disable
PE 0:   MPICH_MPIIO_CB_ALIGN                           = 2
PE 0:   MPICH_MPIIO_DVS_MAXNODES                       = -1
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY       = 0
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE        = -1
PE 0:   MPICH_MPIIO_MAX_NUM_IRECV                      = 50
PE 0:   MPICH_MPIIO_MAX_NUM_ISEND                      = 50
PE 0:   MPICH_MPIIO_MAX_SIZE_ISEND                     = 10485760
PE 0:   MPICH_MPIIO_OFI_STARTUP_CONNECT                = disable
PE 0:   MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR        = 2
PE 0: MPICH MPIIO statistics environment settings ====================
PE 0:   MPICH_MPIIO_STATS                              = 0
PE 0:   MPICH_MPIIO_TIMERS                             = 0
PE 0:   MPICH_MPIIO_WRITE_EXIT_BARRIER                 = 1
PE 0: MPICH Thread Safety settings ===================================
PE 0:   MPICH_ASYNC_PROGRESS                           = 0
PE 0:   MPICH_OPT_THREAD_SYNC                          = 1
PE 0:   rank 0 required = funneled, was provided = funneled
MPICH ERROR [Rank 0] [job id 21208684.35] [Fri Jan 10 09:53:02 2025] [nid001265] - Abort(1734831948) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1734831948) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1734831948) - process 0
srun: error: nid001265: task 0: Exited with exit code 255
srun: Terminating StepId=21208684.35
slurmstepd: error: *** STEP 21208684.35 ON nid001265 CANCELLED AT 2025-01-10T09:53:02 ***
srun: error: nid001265: tasks 1-35: Terminated
srun: Force Terminated StepId=21208684.35

      Failed
Exception raised while running the steps of the test case
Traceback (most recent call last):
  File "/users/althea/code/compass/main/compass/run/serial.py", line 322, in _log_and_run_test
    _run_test(test_case, available_resources)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/run/serial.py", line 419, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              available_resources)
              ^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/run/serial.py", line 470, in _run_step
    step.run()
    ~~~~~~~~^^
  File "/users/althea/code/compass/main/compass/ocean/tests/global_ocean/forward.py", line 224, in run
    run_model(self, update_pio=update_pio)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/model.py", line 60, in run_model
    run_command(args=args, cpus_per_task=cpus_per_task, ntasks=ntasks,
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                openmp_threads=openmp_threads, config=config, logger=logger)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/code/compass/main/compass/parallel.py", line 149, in run_command
    check_call(command_line_args, logger, env=env)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/althea/miniforge3/envs/dev_compass_1.7.0-alpha.1/lib/python3.13/site-packages/mpas_tools/logging.py", line 59, in check_call
    raise subprocess.CalledProcessError(process.returncode,
                                        print_args)
subprocess.CalledProcessError: Command 'srun -c 1 -N 1 -n 36 ./ocean_model -n namelist.ocean -s streams.ocean' returned non-zero exit status 143.
@altheaden altheaden added the bug Something isn't working label Jan 10, 2025
@xylar
Copy link
Collaborator

xylar commented Jan 10, 2025

@altheaden, can you find the directory where this test is running and also post the oceanXXXX.err file? If there are several, just pick the first one.

@altheaden
Copy link
Collaborator Author

@xylar Here is ocean/global_ocean/IcoswISC240/WOA23/performance_test/prognostic_ice_shelf_melt:

----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       0 of      36
    Opened at 2025/01/10 09:53:02
----------------------------------------------------------------------

CRITICAL ERROR: Decomoposition file: graph.info.part.36 contains less than 7302 cells
Logging complete.  Closing file at 2025/01/10 09:53:02

Let me know if you would like me to get any others.

@xylar
Copy link
Collaborator

xylar commented Jan 10, 2025

That's really weird! It seems like one of the cached files needs to be replaced. I'm about to update them anyway so no worries for now.

@xylar xylar mentioned this issue Jan 10, 2025
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants