Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migration: fix re-entrant bug #431

Merged
merged 1 commit into from
Apr 21, 2024
Merged

migration: fix re-entrant bug #431

merged 1 commit into from
Apr 21, 2024

Conversation

csegarragonz
Copy link
Collaborator

This PR fixes a rare race condition that only happened in enviornments where the same app is migrated many times.

In particular, this bug only appeared when the same application migrated away from one host, and then migrated back into it. Migrating into a new host (wrt the previous scheduling decision) requires one of the migrated-to ranks to run the world initialisation to set the local-remote leaders and in-memory queues. However, the second migration above was not triggering the "new world" migration procedure because the world had lingered in the per-node registry.

This bug materialised in applications having an old version of the host-port mappings, and failing to start.

The fix involves knowing when we are evicting a host for a given world id, and clearing it from the registry if so.

This PR fixes a rare race condition that only happened in enviornments
where the same app is migrated many times.

In particular, this bug only appeared when the same application migrated
away from one host, and then migrated back into it. Migrating into a new
host (wrt the previous scheduling decision) requires one of the
migrated-to ranks to run the world initialisation to set the
local-remote leaders and in-memory queues. However, the second migration
above was not triggering the "new world" migration procedure because the
world had lingered in the per-node registry.

This bug materialised in applications having an old version of the
host-port mappings, and failing to start.

The fix involves knowing when we are evicting a host for a given world
id, and clearing it from the registry if so.
Copy link

codecov bot commented Apr 21, 2024

Codecov Report

Attention: Patch coverage is 17.94872% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 81.76%. Comparing base (7304a61) to head (0b6d532).

Files Patch % Lines
src/executor/Executor.cpp 0.00% 20 Missing ⚠️
src/mpi/MpiWorld.cpp 38.88% 11 Missing ⚠️
src/mpi/MpiWorldRegistry.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #431      +/-   ##
==========================================
- Coverage   82.05%   81.76%   -0.30%     
==========================================
  Files         115      115              
  Lines        7628     7660      +32     
==========================================
+ Hits         6259     6263       +4     
- Misses       1369     1397      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@csegarragonz csegarragonz merged commit aef8f6e into main Apr 21, 2024
11 of 12 checks passed
@csegarragonz csegarragonz deleted the fix-migration branch April 21, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant