-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfaults for MPCD with large particle counts under MPI #1897
Conversation
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Not stale—haven't had time to finish this off. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
@joaander I will rebase this onto |
bba7e0f
to
f23d6e0
Compare
@joaander this should be ready to take a look at! I updated the regular particle communicator to use these new data structures, but I didn't touch the use of |
|
||
if (rank == root) | ||
{ | ||
/* | ||
* First sort the particles into tag order. This is probably a slow step, but could be | ||
* sped up by sorting on each rank first then merging the sorted segments here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this optimization in a subsequent PR, as well as packing the buffers on the GPU. I left it off for now in order to get this finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem. You might find the GatherTagOrder
class helpful in implementing the optimization. I wrote it to solve a similar problem when writing GSD files and gathering forces.
Lines 511 to 531 in f725ae3
/// Helper class that gathers local values from ranks and orders them in ascending tag order. | |
/** | |
To use: | |
1) Call setLocalTagsSorted() to establish the sizes and orders of the arrays to be gathered. | |
2) Call gatherArray() to combine the per-rank local arrays into a global array in ascending tag | |
order. | |
3) Repeat step 2 for as many arrays as needed (all must be in the same tag order). | |
GatherTagOrder maintains internal cache variables. Reuse an existing GatherTagOrder instances | |
to avoid costly memory reallocations. | |
Future expansion: Provide an alternate entry point setLocalTagsUnsorted() that allows callers | |
to provide arrays directly in index order. | |
**/ | |
class GatherTagOrder | |
{ | |
public: | |
/// Construct GatherTagOrder | |
/** \param mpi_comm MPI Communicator. | |
\param root Rank to gather on. | |
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I remembered that you had written this and wanted to try it, but didn’t have time now. I will keep this mind when I come back to it later. Gathering these snapshots can be very slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
FYI: In running with one of the other branches that didn’t have this fix, I found that some MPI libraries internally check and report the error we were unintentionally triggering (count too large). This supports that this is the right fix. |
It is strange that the MPI API hasn't addressed this defect more aggressively. It has been 25 years since 64 bit processors have become commonplace. |
It is indeed strange and frustrating. It should be addressed as part of the MPI-4 standard, but that will probably take a while to percolate through the implementations. https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node499.htm#Node499 |
Description
This PR refactors parts of the MPCD code that relied on generic serialized MPI routines to instead use custom MPI datatypes. This increases the amount of data that can be sent because each object is larger. This is an attempt to fix segfaults reported when either initializing from or taking a snapshot with a large number of MPCD particles.
I replaced most uses of
bcast
withMPI_Bcast
while I was at it since this is a relatively simple MPI call.I also added exceptions when the serialized methods exceed the byte count that can be stored in a signed int.
Motivation and context
Large numbers of particles seem to lead to overflow of the serialized MPI methods.
Resolves #1895
How has this been tested?
Existings tests pass. I will confirm with reporting user (or test on a cluster myself) that this fixes the segfaults they were observing.
Change log
Checklist:
sphinx-doc/credits.rst
) in the pull request source branch.