-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should the number of contexts correspond one-to-one with the number of threads? #759
Comments
@soumagne |
I can't see much from that trace but in general you'd want to make sure that you set |
@soumagne
In this situation, we found that when the number of threads reaches a certain amount, such as 100, crashes are likely to occur. By checking the call stack of the crash, we found it in the na_sm_progress function. We would like to ask, what is the reason for this crash? Attached is the call stack at the time of the crash. mercury_crash_stack.txt Another question: |
It seems fine for that particular usage though this combination of one class + multiple contexts is not extensively tested, this might be an issue specific to the shared-memory plugin, I will need to see if I can reproduce it. As I was mentioning you could instead try to use one class + one context per thread and see if that works better, that would also prevent any contention that might occur underneath when using multiple contexts per class. |
Describe the bug
In version 2.3.1 of Mercury, applications using the TensorFlow framework will create many threads, approximately around 100 within a process . If each thread corresponds to a context created in Mercury, the application crashes when it runs, displaying the following output.
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.838843 WARNING: PRINT_BACKTRACE: get a signal(11), pid:2983066, tid:298456in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839184 WARNING: PRINT_BACKTRACE: symbols=0x7fb640080410 pid:2983066, tid:2984568 in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839222 WARNING: Call stack: in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839261 WARNING: /xxx/libxxx-client.so(releasexxxClient+0x82) [0x7fb8bc159882] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839290 WARNING: /usr/lib64/libc.so.6(+0x37400) [0x7fb8ba8ef400] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839319 WARNING: /xxx/mercurylib/libna.so.4(+0x1b6cb) [0x7fb8ba25e6cb] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839346 WARNING: /xxx/mercurylib/libmercury.so.2(+0x1084d) [0x7fb8ba48184d] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839373 WARNING: /xxx/mercurylib/libmercury.so.2(+0x1254a) [0x7fb8ba48354a] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839399 WARNING: /xxx/mercurylib/libmercury.so.2(HG_Core_progress+0x70) [0x7fb8ba48a840] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839433 WARNING: /xxx/mercurylib/libmercury.so.2(HG_Progress+0xe) [0x7fb8ba47a17e] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839460 WARNING: /xxx/libxxx-client.so(+0x5b301) [0x7fb8bc0ae301] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839486 WARNING: /xxx/mercurylib/libmercury_util.so.4(hg_request_wait+0xe6) [0x7fb8ba03e996] in client.c(488)
[0]:pid:2983066, tid:2984568, 08/21/24 15:26:31.839512 WARNING: /xxx/libxxx-client.so(send_cumemcpyhtodasync_v2+0x119) [0x7fb8bc0b3fa9] in client.c(488)
However, if only one context is created for all these threads, the application runs normally but with poor performance.
So, how many contexts should be created to be reasonable? Is there a recommended value?
The text was updated successfully, but these errors were encountered: