Remove the VMEpoch and other blocking communication limitation #245

oehmke · 2024-05-09T18:12:33Z

Right now you shouldn't use blocking communication calls within a VMEpoch region. Here is what Gerhard says:

I took a look at the ESMF_InfoBroadcast() implementation, and right now it uses just a simple blocking collective MPI_Bcast() under the hood for the mpionly case (which I am sure this is). It is therefore not currently safe to use this call within an active VMEpoch! It will hang.

The underlying reason for this is that within a VMEpoch, all non-blocking send & recv calls are intercepted. On the send side all messages to the same dst are aggregated, and not send until the VMEpoch is exited. However, on the recv side, the first non-blocking recv is going to block probing for the incoming message to determine its size (since it was aggregated on the src side and unknown on the dst side). Not until the msg was received will a receiving PET continue and process any of the other receives (potentially many in the loop over routehandle based SMM() and GridRedist() in this example). Anyway, because of this behavior a blocking collective call like the MPI_Bcast() used by ESMF_InfoBroadcast() inside the VMEpoch will deadlock!

It would probably be pretty straight forward to extend the MPI_Bcast() to check whether it is being called from within an active VMEpoch, and if so use non-blocking calls instead. This would make Dusan's code safe, and probably also a bit more efficient. Could be something for 8.8? In fact we could go through all our VM collectives and make them VMEpoch-safe. Not sure about practical importance of it though. After all, VMEpoch is meant to optimize very specific communication patterns, and that means typically around a tight loop of SMM() calls or such.

The idea of this issue is to remove this limitation.

anntsay · 2024-08-09T16:13:13Z

maybe consider to do the error check work first.

anntsay · 2025-02-26T18:25:56Z

Gerhard: may be low priority because the impact of this is on the fringe. next time when reviewing this ticket, consider both the error check ticket #360 to see if we should just go directly to fixing the root cause (this ticket), or has a intermediate step of error check.

oehmke assigned theurich and oehmke May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the VMEpoch and other blocking communication limitation #245

Remove the VMEpoch and other blocking communication limitation #245

oehmke commented May 9, 2024

anntsay commented Aug 9, 2024

anntsay commented Feb 26, 2025

Remove the VMEpoch and other blocking communication limitation #245

Remove the VMEpoch and other blocking communication limitation #245

Comments

oehmke commented May 9, 2024

anntsay commented Aug 9, 2024

anntsay commented Feb 26, 2025