Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to SIGQUIT or throw error during ESMF_Abort #296

Open
danrosen25 opened this issue Sep 11, 2024 · 5 comments
Open

Option to SIGQUIT or throw error during ESMF_Abort #296

danrosen25 opened this issue Sep 11, 2024 · 5 comments
Assignees
Labels
feature/enhancement New feature or request

Comments

@danrosen25
Copy link
Member

The current method to debug ESMF Errors is to build a back trace using ESMF_LogSetError and rc. This gives you a limited amount of information about the state at the time of the error. I started investigating throwing a SIGQUIT error, which can print a backtrace and dump a core. The core dump can be analyzed to see the state causing the error.

diff --git a/src/Infrastructure/VM/src/ESMCI_VMKernel.C b/src/Infrastructure/VM/src/ESMCI_VMKernel.C
index 63b85ad0c3..43c85c5c5c 100644
--- a/src/Infrastructure/VM/src/ESMCI_VMKernel.C
+++ b/src/Infrastructure/VM/src/ESMCI_VMKernel.C
@@ -899,6 +899,7 @@ struct SpawnArg{
 void VMK::abort(){
   // abort default (all MPI) virtual machine
   int finalized;
+  raise (SIGQUIT);
   MPI_Finalized(&finalized);
   if (!finalized)
     MPI_Abort(default_mpi_c, EXIT_FAILURE);
@danrosen25 danrosen25 self-assigned this Sep 11, 2024
@danrosen25 danrosen25 added the feature/enhancement New feature or request label Sep 11, 2024
@anntsay
Copy link

anntsay commented Oct 2, 2024

Dan propose to have this as a runtime option -> that way ESMF quit on error and output info. this allow easier troubleshooting and debugging.

Bob: looks reasonable. and maybe put in 8.8 becuase it is not heavy weight. and this new method will be optional.
Ann confirm that ESMF_LogSetError and rc will still be available and will be the default.

@anntsay
Copy link

anntsay commented Oct 2, 2024

Bill: CESM also uses this.. it make sense to use this as an option
Dan: this is only optional method.. default is still the current method. this is set as a one time flag at run-time.

@danrosen25
Copy link
Member Author

Look at the LogSetError option for abort on error.
Runtime flag (using environment) ESMF_RUNTIME_ABORT_ON_ERROR

@anntsay
Copy link

anntsay commented Feb 26, 2025

design consideration on to handle MPI aborts that makes this story a medium.

this ticket may be beneficial to CESM: CESM back traces is only available to certain compilers and so this feature may help.

Bill: is there a C mechanism for producing backtrace?
gnu backtraces
execinfo
Gerhard: can unroll the stacks.

@danrosen25
Copy link
Member Author

Testing on Mac OS and Derecho
raise (SIGQUIT);
SIGQUIT will terminate the current task and the mpirun application is sending SIGTERM to other processes.

Executing SIGQUIT on rank 2

dec2436.hsn.de.hpc.ucar.edu 1: rank-1 do nothing
dec2448.hsn.de.hpc.ucar.edu 5: rank-5 do nothing
dec2448.hsn.de.hpc.ucar.edu 6: rank-6 do nothing
dec2436.hsn.de.hpc.ucar.edu 0: rank_sum:28
rank-0 do nothing
dec2448.hsn.de.hpc.ucar.edu 7: rank-7 do nothing
dec2436.hsn.de.hpc.ucar.edu 3: rank-3 do nothing
dec2448.hsn.de.hpc.ucar.edu 4: rank-4 do nothing
dec2436.hsn.de.hpc.ucar.edu: rank 2 died from signal 3 and dumped core
dec2436.hsn.de.hpc.ucar.edu: rank 1 died from signal 15
RESULT=143

Adding sleep for longer than walltime

dec2436.hsn.de.hpc.ucar.edu 0: rank_sum:28
rank-0 do nothing
dec2436.hsn.de.hpc.ucar.edu 1: rank-1 do nothing
dec2448.hsn.de.hpc.ucar.edu 4: rank-4 do nothing
dec2436.hsn.de.hpc.ucar.edu 3: rank-3 do nothing
dec2448.hsn.de.hpc.ucar.edu 5: rank-5 do nothing
dec2448.hsn.de.hpc.ucar.edu 6: rank-6 do nothing
dec2448.hsn.de.hpc.ucar.edu 7: rank-7 do nothing
=>> PBS: job killed: walltime 77 exceeded limit 60
Terminated
dec2436.hsn.de.hpc.ucar.edu: rank 1 died from signal 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants