-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve race-condition in disable_module_accelerator
#17811
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
saved = self._use_fast_lib | ||
self._use_fast_lib = False | ||
try: | ||
yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a dumb question Is it possible for self._use_fast_lib
to be modified during this yield statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this move seems to be also bringing back the race-condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this makes sense. Suppose we have two threads, initially use_fast_lib
is True
.
Now let's run
with disabled():
some_code()
The idea with the context manager is that inside the block use_fast_lib
is False
.
However if there are two threads then we can have the following scenario ("time" is increasing downwards)
# Initially use_fast_lib = True
thread-A thread-B
acquire_lock()
saved = use_fast_lib # reads True acquire_lock() # blocks
use_fast_lib = False
release_lock()
saved = use_fast_lib # reads False
use_fast_lib = False
some_code() # sees use_fast_lib = False
release_lock()
acquire_lock()
use_fast_lib = saved # writes True
release_lock()
some_code() # sees use_fast_lib = True
acquire_lock()
use_fast_lib = saved # writes False
release_lock()
So there are two bugs here:
- On thread-B
some_code
thinks it is running withuse_fast_lib
disabled, but in fact is running withuse_fast_lib
enabled. - After thread-B's context manager exits, it restores the saved version of
use_fast_lib
, which isFalse
. And we end up outside the block withuse_fast_lib = False
on both threads, whereas it should beTrue
.
@@ -613,7 +616,7 @@ def disable_module_accelerator() -> contextlib.ExitStack: | |||
""" | |||
Temporarily disable any module acceleration. | |||
""" | |||
with contextlib.ExitStack() as stack: | |||
with ImportLock(), contextlib.ExitStack() as stack: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Is it ok if we call stack.enter_context(ImportLock())
after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this change too and the race-condition seems to be coming back.
Note: I tried both suggestions separately and together, all three combinations result in branch-25.02
behavior i.e., a race condition.
/merge |
133e0c8
into
rapidsai:branch-25.02
Description
Fixes: #17775
This PR fixes a race condition that arises when
disable_module_accelerator
is used in a multi-threaded setting.Checklist