Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPI_Finalize "detouring" mechanism #765

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

program--
Copy link
Contributor

@program-- program-- commented Mar 15, 2024

This PR resolves #748 by implementing a shim library that overrides MPI_Finalize calls to become no-op. Additionally, calls to MPI_Finalize from NGen code are modified to PMPI_Finalize to call the MPI library's actual implementation.

Adding everyone as reviewers for visibility on this change.

Additions

  • CMake target MPIDetour in utilities/mpi/ with the source file MPIDetour.c.

Changes

  • Both calls to MPI_Finalize in src/NGen.cpp are converted to PMPI_Finalize.

TODO

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Target Environment support

  • Linux
  • macOS

program-- and others added 2 commits March 15, 2024 12:04
Co-authored-by: Phil Miller - NOAA <[email protected]>
Co-authored-by: Phil Miller - NOAA <[email protected]>
@PhilMiller
Copy link
Contributor

Something needs to say target_link_library(ngen MPIDetour), too

@program-- program-- force-pushed the jsm-pm-748-mpi-detour branch from 2861184 to 7cf8df5 Compare March 15, 2024 22:34
@hellkite500
Copy link
Member

hellkite500 commented Mar 16, 2024

Ubuntu linux tests failing due to a timeout running the tests. Downloading the logs and poking around, I see a LOT of these DEADLYSIGNAL messages comfing from running the test_bmi_c target

2024-03-15T22:38:01.4674964Z [100%] Linking CXX executable test_bmi_c
2024-03-15T22:38:02.1738113Z AddressSanitizer:DEADLYSIGNAL

and eventually the test times out

2024-03-15T22:38:17.3139827Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T22:38:17.3140454Z CMake Error at /usr/local/share/cmake-3.28/Modules/GoogleTestAddTests.cmake:112 (message):
2024-03-15T22:38:17.3140582Z   Error running test executable.
2024-03-15T22:38:17.3140590Z 
2024-03-15T22:38:17.3140878Z     Path: '/home/runner/work/ngen/ngen/cmake_build/test/test_bmi_c'
2024-03-15T22:38:17.3141033Z     Result: Process terminated due to timeout
2024-03-15T22:38:17.3141126Z     Output:
2024-03-15T22:38:17.3141215Z       
2024-03-15T22:38:17.3141230Z 
2024-03-15T22:38:17.3141350Z Call Stack (most recent call first):
2024-03-15T22:38:17.3141791Z   /usr/local/share/cmake-3.28/Modules/GoogleTestAddTests.cmake:225 (gtest_discover_tests_impl)
2024-03-15T22:38:17.3141803Z 
2024-03-15T22:38:17.3141807Z 
2024-03-15T22:38:17.3142098Z gmake[3]: *** [test/CMakeFiles/test_bmi_c.dir/build.make:133: test/test_bmi_c] Error 1
2024-03-15T22:38:17.3142289Z gmake[3]: *** Deleting file 'test/test_bmi_c'
2024-03-15T22:38:17.3142567Z gmake[2]: *** [CMakeFiles/Makefile2:2163: test/CMakeFiles/test_bmi_c.dir/all] Error 2
2024-03-15T22:38:17.3142850Z gmake[1]: *** [CMakeFiles/Makefile2:2170: test/CMakeFiles/test_bmi_c.dir/rule] Error 2
2024-03-15T22:38:17.3142986Z gmake: *** [Makefile:921: test_bmi_c] Error 2
2024-03-15T22:38:17.3154227Z ##[error]Process completed with exit code 2.
2024-03-15T22:38:17.3379867Z Post job cleanup.

@program--
Copy link
Contributor Author

program-- commented Mar 16, 2024

Ubuntu linux tests failing due to a timeout running the tests. Downloading the logs and poking around, I see a LOT of these DEDLYSIGNAL messages comfing from running the test_bmi_c target
...

@hellkite500 Yeah this is happening across PR's and the master branch, see: #750 (comment). Not sure why this is happening, but it's not reproducible locally for neither Phil nor I

@hellkite500
Copy link
Member

Ah, I'm still catching up on a few things. Just wasn't sure if this was related to these changes or not...I looked at a few recent PR's and didn't see the failures. Might try getting an Ubuntu image spun up to test on.

@program-- program-- marked this pull request as ready for review March 18, 2024 16:42
CMakeLists.txt Outdated Show resolved Hide resolved
@PhilMiller PhilMiller marked this pull request as draft March 25, 2024 14:49
@PhilMiller
Copy link
Contributor

Converting back to draft, since we may not need this for the time being if ESMF PR esmf-org/esmf#234 gets merged and released quickly. It could be useful for other uncooperative libraries that insist on acting like they own MPI, though.

@donaldwj
Copy link
Contributor

donaldwj commented Mar 25, 2024

Where is the PMPI_Finalize coming from? I can not find documentation on It, I am guessing that it is directly called by the MPI library provided MPI_Finalize. Also is using this compatible across different MPI implementations? Ok this is the from section 15 of the MPI standard, documentation on this is not something easy to find. But It looks like this should be fine on all MPI implementations that meet the standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault thrown when Python is finalizing ESMF through pybind11
4 participants